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Foreword 


After stating that Bayesian methods are now becoming widely accepted as a way to solve applied 
statistical problems, Professor Bansal writes in the preface of his book that his book “...is an attempt to 
bridge the gap between existing advanced and elementary texts on the subject...”. There is no question 
but that Professor Bansal has identified a great need and has filled it with a book that reflects his deep 
understanding of the Bayesian approach and ability to explain and illustrate its principles and applications 
in a clear, understandable and operational manner. Readers of the book will not only obtain philosophical 
understanding of the Bayesian approach but also become familiar with its operations and applications of 
them in analyses of central statistical estimation, testing, prediction and control problems relating to 
many statistical models that are often employed in practice. That he was able to combine theoretical and 
operational aspects of Bayesian analysis so well in his book is a remarkable achievement that reflects 
many years of productive theoretical and applied research by him and his students on many Bayesian 
theoretical and applied problems and his many years of successful lecturing. 

Along with this fine coverage of Bayesian theoretical and applied principles based on relatively 
simple mathematical requirements, there is an important set of about 500 problems with solutions provided, 
many references and a glossary of Bayesian terms. These items, combined with the exposition of basic 
theoretical principles, should make the book extremely valuable not only to students but also to many 
researchers, decision-makers and others who wish to obtain a good appreciation of and introduction to 
the Bayesian approach to statistical inference and decision-making. 

Thanks to Professor Bansal for his fine contribution to the literature on Bayesian analysis that 
will be of great value to students and others in their quest to obtain better solutions to statistical 
inference and decision problems. 


Arnold Zellner 


University of Chicago 
Chicago, USA 


Preface 


The purpose of statistical analysis is to procure the causes in the form of model parameters from the 
effects summarized by observations. In the 18" century, Rev. Thomas Bayes and Pierre Simon Laplace 
argued that causes and effects should be put on the same conceptual level by treating observations and 
parameters as random variables. 

The implementation of the Bayesian paradigm depends on assigning probability distributions not 
only to observable data variables but also to unknown parameters. The introduction of prior information, 
based on past studies or opinions of subject area experts, about the possible values of parameters is the 
distinctive feature of the Bayesian approach. The celebrated Bayes theorem provides a formal rule to 
combine information contained in a prior distribution with the sample information contained in the likelihood 
function to give posterior distribution that contains all the probabilistic information about the parameters. 
Probabilizing uncertainty allows a Bayesian to make direct probability statements about the values of the 
parameters and future values of as yet unobserved outcomes of the experiment. 

The Bayesian approach to parametric inference refers to prior, posterior and predictive distributions 
to obtain estimates, compare models and test hypotheses, and make predictions conditional on an observed 
sample. Finite sample results as well as excellent asymptotic results are not difficult to derive. A Bayesian 
just learns how to apply Bayes theorem instead of learning a large number of ‘ad-hoc’ frequentist inferential 
techniques of the Neyman, Pearson and Fisherian era. 

Bayesian methods are now becoming widely accepted as a way to solve applied statistical problems 
in industries and government. Research groups in various disciplines like econometrics, education, law, 
archaeology, engineering, medical and life sciences are using Bayesian inferential methods to obtain 
optimum solutions to their problems. 

A number of books on Bayesian methodology and inference are available at various levels of 
mathematical sophistication or practical applications besides professional literature in the form of 
monographs and research papers. The aim of Bayesian Parametric Inference is modest. It is an attempt to 
bridge the gap between existing advanced and elementary texts on the subject, while being mindful of 
Arnold Zellner’s advice: “Keep it sophisticatedly simple”. 

The prerequisites are, therefore, knowledge of calculus and elementary matrix algebra, undergraduate 
level of probability, univariate and multivariate distributions, and statistical inference. 

The book consists of eleven chapters. The first two chapters provide relevant definitions, results 
of the calculus of probability, and standard distributions. Readers for whom it is insufficient may consult 
any standard text on probability and statistical methods. 

James Bernoulli and Abraham de Moivre knew the Bayes theorem as a formula for computing 
conditional probability. Bayes reinterpreted it as a formula to update prior probabilities of a hypothesis 
about the parameter of the model in the light of observed data. Chapter 3 is devoted to various 
interpretations of Bayes theorem and its applications. Readers will also find Zellner’s recent proof of 
Bayes theorem using Information Conservation Principle in this chapter. 
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Prior distribution on parameters is a crucial input in any Bayesian statistical analysis. The next two 
chapters are devoted to construction of conjugate and non-informative prior distributions. The information 
theoretic approach to obtain Jayne’s Maximum Entropy and Zellner’s Maximum Data Information prior 
distributions are also discussed in Chapter 5. 

In Chapter 6 the decision theoretic approach is used to discuss Bayes estimation of the parameter. 
A variety of loss functions, including Varian’s linex and Zellner’s balanced loss functions, are introduced 
to obtain the Bayes estimate. Duality between loss and prior, and a criterion for choosing a weight 
function to construct a weighted loss function are explained with the help of examples. 

Bayes factors are often used to compare statistical hypotheses and models. In Chapter 7, the 
concepts of Bayes factor and decision theoretic approach are given and explained with the help of some 
standard examples. 

The fundamental problem of inference is to pass from a set of observations to express quantified 
opinion about an as yet unobserved set. In Chapter 8, Predictive Distributions for standard problems are 
obtained and applied to solve inferential problems concerning finite populations, reliability theory and 
inventory control. 

The general linear model includes regression models as a special case. Chapter 9 gives a Bayesian 
inference for the regression parameter and prediction of future unobserved values in decision theoretic 
framework. Bayesian analysis of simple control problems and Poisson regression superpopulation models 
are discussed as prediction problems. 

Chapter 10 includes useful results concerning large sample approximations of posterior distribution 
and posterior moments. Finally, the last chapter discusses some further topics like robustness of Bayesian 
inference, Bayesian approach to change point and outlier problems, and empirical and hierarchical Bayes 
estimation procedures. 

The book also contains a glossary of Bayesian terms, about 550 problems with solutions to about 
300 problems, and a number of remarks to supplement the main results. References are also given at the 
end of the book. 

We have not covered topics like Bayesian computations, subjective probability and utility theory 
to keep the book within reasonable bounds. Interested readers may refer to the book by Chen, Shao and 
Ibrahim (2000) for Monte Carlo methods; Leonard and Hsu (1999), Berger (1985) and DeGroot (1970) for 
utility theory; and O’Hagan and Forster (2004), Barnett (1982), and DeGroot (1970) to learn subjective 
probability. 

It is impossible to list all my debts. I am indebted to my teachers, in particular Ram Ballabh, A.R. Roy 
and S.R. Srivastava at the University of Lucknow, and IR. Savage and R.G. Cornell at the Florida State 
University for introducing me to the statistical methodology and Bayesian approach to statistics. | am 
thankful to Ram Karan for encouraging and coaxing me to write this book for students of Indian universities 
as well as to numerous Bayesians whose books, expository articles and research papers helped me 
understand Bayesian concepts, principles, methodology and philosophy. In particular, I must express my 
gratitude to my latent gurus M.H. DeGroot, G.E.P. Box, James Berger, Arnold Zellner, D.V. Lindley and H. 
Jeffreys for providing illustrative examples and illuminating discussions. My debt is evident in the extent 
to which I have referred to their work. I am thankful to my students who always asked challenging 
questions inside and outside the classrooms. Priyanka Aggarwal deserves special thanks for undertaking 
the responsibility of preparing the manuscript, working out the examples and keeping me on my toes 
throughout this project. 

And last, but far from least, I am thankful to my ever patient and loving wife Sudakshina for her 
understanding and unconditional support. I have no words to thank my parents, Ram Ballabh and 


Preface IX 


Pushpawati who gave me their best during my formative years. I cannot forget Bhavtosh, Moyna and 
Paritosh for extending emotional support and encouragement towards the completion of this book. 

Iam also grateful to the University of Delhi for giving me opportunity to teach ever-attentive bright 
students and providing facilities to do research over the last three decades. Acknowledgements are due 
to Mr. Yusuf for preparing the soft copy of the manuscript. And I acknowledge with thanks my indebtedness 
to N.K. Mehra of Narosa for his co-operation in this endeavour. 

The readers may find some misprints, errors, and ambiguity of presentation. I shall be grateful to 
any reader who brings these to my attention. Any errors and omission are unintentional. 


Ashok K. Bansal 
University of Delhi 


Delhi 
November 2006 
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Chapter | 


Probability, Random Variables and their 
Probability Distributions 


In this chapter, we state some basic results from set theory, counting techniques, theory of probability, 
probability functions of random variables and their properties, and conditional expectations. These 
results are required quite frequently in the book. The discussion in the following sections is not meant 
to be self-contained. The results from calculus and linear algebra are provided in later chapters as and 
when they are to be used. 


1.1. ELEMENTS OF SET THEORY APPLIED TO EVENTS 


We shall denote events by letters A, B, C, ... or sometimes by A. A, A; ... The contrary to 
event A, that is, the event A does not happen will be represented by A’ which may be read as ‘not 
A’. The events A and A' together make up whole of the space ¢% A’ is often called the complement 
of A. If A and B are events within the space ¢% A is said to imply B if whenever A occurs, B 
necessarily occurs and we denote it by ACB. If ACB and BCA, the events A and B are equivalent 
and we denote it by A = B. The event “both A and B’ is called the intersection of A and B and is 
represented by AMB. The event ‘A and/or B’ is called the union of A and B and is denoted by AUB. 
If the events A and B are such that they cannot both happen at the same time, they are said to be 
disjoint or mutually exclusive. In general, the intersection of the events A,, A,, ..., A, is denoted by 


(A, while their union is denoted by UA, . The event ANA’ may be interpreted as an impossible 


k=l k=l 
event and we denote it by ¢. The set of points representing © is said to be a null set. 

The operations of union and intersection are commutative and associative and the two 
operations are distributive as well, that is, 


AN(BUC)=(ANB)U(ANC) 
and AU(BNC)=(AUB)N(AUC). 
The difference of sets A and B, denoted by A-B, is AN B’ 


De Morgan’s Laws 


(ANB) =A’UB’ 
and (AUB) =A’AB 
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Definition 1.1. A non-empty class of subsets of ~ which is closed under the formation of countable 
unions and complements and contains the null set @, is known as a 0-field (or a O—algebra). 
Definition 1.2. Borel field is the o-field generated by the class of all bounded semi-closed intervals 
of the form (a, b] and is denoted by %. The elements of ¥ are called Borel sets. 

Remark 1.1. Every countable set of real numbers is a Borel set. 

Remark 1.2. Every Borel set of real numbers can be obtained by a countable number of operations 
of unions, differences, and intersections performed on intervals. 


1.2. COUNTING TECHNIQUES 


Definition 1.3. (Multiplication Principle) If the first operation can be performed in any one of the n 
ways and the second operation can be performed in any m ways, then both operation can be 
performed in nm ways. 
Definition 1.4. An arrangement of n symbols in a definite order is called the permutation of n symbols. 
Definition 1.5. A Cartesian product AxB of A and B is a set of all possible 2-tuples (ordered pairs) 
(x, X,) where xeA and X,€ B, that is, AXB = {&, X,) X,€ A, x,€B}. 
Definition 1.6. An r-tuple is an ordered array of r components written (XK, Xye0e5X,)- 

The number of r-tuples (r<n), using n different symbols (each only once), is called the number 
of permutations of n things r at a time and is denoted by "P.. 


n! 
coe 


Definition 1.7. The number of distinct subsets, each of size r, that can be constructed from a set with 
n elements is called the number of combinations of n things r at a time and is denoted by 


Result 1.1. "p = 


r 


n 
or "C., 
r 
! (n+l 
Result 1.2. = = = (a ) : 
tr} rlm-r)! T(rt+1)0(n-r+l) 
n 
Remark 1.3. is also known as Binomial Coefficient. 
r 


Definition 1.8. Suppose we have n objects, n, of which are alike, n, of which are alike, but different 
from others, and so on, out to n, which are of kth kind. The number of ways in which these n objects 


Jk 
can be permuted among themselves without changing their arrangements is ny I] n,! and is 


i=l 


denoted by . This quantity is frequently called the multinomial coefficient. 
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Consider the experiment consisting of placing r distinguishable balls in n cells. We must choose 
one cell for each ball. The sample space consists of n‘, r-tuples (i, i, ss i), where i, is the cell number 


of the jth ball G = 1, 2, ...,n), 1< i <n. Suppose we are given n elements A, A, or A, and we are 


selecting r elements one by one. Then there are two possibilities 

1) Sampling with replacement - In such a sampling scheme, we identify the element selected and 
note down its subscript and replace the element. This process is repeated r times. We shall have 
n’ samples of size r. In this case, the repetitions are permitted and we can draw samples of 
arbitrary size. 

2) Sampling without replacement - In this case, no repetitions are allowed and therefore an 
element once chosen is not replaced. Obviously, the sample size cannot exceed n and there are 
n(n—1) ... (n—-r+1) possible samples of size r. 


1.3 PROBABILITY 


Definition 1.9. A random experiment is an experiment in which 

(i) all outcomes of an experiment are known in advance, 

(ii) | any performance of the experiment results in an outcome that is not known in advance, and 
(iii) the experiment can be repeated under identical conditions. 

Definition 1.10. A sample space for an experiment is a set of all possible distinct outcomes that 
might be observed. 

Definition 1.11. An event A is a subset of a sample space %, Ac'%. An event is said to have occurred 
if any one of its elements is the outcome observed. 

Definition 1.12. Two events that cannot occur simultaneously are called mutually exclusive events. 
Remark 1.4. The corresponding subsets are called disjoint sets. 

Definition 1.13. An assignment of probability is said to be equally likely if each elementary event 
in the sample space % is assigned the same probability. Thus, if contains n points Xi, P(x.) = I/n, 
JH L 2ee 

Remark 1.5. With this assignment, the probability of an event A is 


number of elementary events in A 


P(A) = a 
total number of elementary events in 
Venn (1866) formalized the idea of expressing probability in terms of the limiting values of relative 
frequencies in indefinitely long sequences of repeated and identical situations. 
Definition 1.14. (Classical definition of probability) If a random experiment can lead to n mutually 
exclusive and equally likely outcomes and if (A) is the number of outcomes that have an attribute A, 
then 

(A) 


P(A) =—— 
n 


A 
Remark 1.6. The ratio A) is also called relative frequency of the event A in n independent trials 
n 


performed and is denoted by R (A). Thus, R (A) may be considered as an approximation to the “true” 


probability function and P(A) may be defined as limR, (A) . This is the frequency view of probability. 


no 
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Definition 1.15. Let % be the sample space of a random experiment and c°%/be the o-field associated 
with it. A real valued non-negative set function P, defined for each event in c°% is called a probability 
if it satisfies the following properties 

()  P(#=1 


di) ~=iIf A) A, .... 18 a Sequence of disjoint events then 


(U ake P(A,) (1.1) 


Remark 1.7. Property (ii) is called countable additivity. 
Remark 1.8. If @ is a null set, then P(@) = 
Remark 1.9. P is also finitely additive set function. 


Result 1.3. If A,, A, ¢ o/such that A, CA, then P(A,) <P(A,) and P(A,-A,) = P(A,) — P(A)), 
since A, -A, = A, A,’ 

Result 1. 4. For all Ae oo P(A) € [0, 1]. 

Remark 1.10. If P(A) = 0 for some A € c% we call A an event with zero probability or a null event. 


However, it does not follow that A = © (null set). Similarly, if P(B) = 1 for some Be c% we call 
B a sure event but it does not follow B = %. 


Result 1.5. If A,, A, ¢ oY%then P(A, UA,) = P(A,)+P(A,)-P(A, NA,)- 
Remark 1.11. If A,, A,, ..., A, are arbitrary number of events then 


r(U A, } y" P(A) (1.2) 


This property is known as subadditivity. 
Remark 1.12. P(A’) = 1 — P(A). 


1.4 CONDITIONAL PROBABILITY 


Consider the probability set function P(A) defined on the sample space %. Suppose A,c'¢, such 
that P(A,) > 0. Let us consider the outcomes of the random experiment which are elements of A,. Let 
A, be another subset of ‘¢ We are interested in defining the probability of event A, relative to the 
new sample space A,. Let us denote the conditional probability of A,, given A,, ‘by the symbol 
P(A 1A, ). Since A, is ‘the sample space, we should be able to define the eonibal P(A, |A, ) in such a 
way that 
@)  P(AJA,) = 1, and 
(ii) P(AJA,) = P(A,AAA,), since our interest is in only those elements of A, which are also 

elements of A,. 

However, in order to have meaningful definition of conditional probability, we must make sure 
that the ratio of the probabilities of the events ANA, and A, relative to the space A be the same 
as the ratio of the probabilities of these events relative to the original space %. Mathematically, we 
should have 


P(A, NA,|A,) P(A, A,) 
P(A, | A,) P(A,) 


On using (i) and (ii), we have 
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P(A, A,) 
P(A, |4)=—— — 
P(A,) 


Definition 1.16. The conditional probability of the event A,, given the event A,, provided 
P(A,) > 0 is 
P(A, OA,) 


P(A, | A,) = 
P(A,) 


(1.3) 


It is clear that the conditional probability should have the following properties: 


) P(A, 


A,) 20, 
Gi) =P(A,|A,) =1, and 


(iii) (Us In bE P(A,|A,); (1.4) 


j=2 
provided A,, A,, ... are mutually exclusive events. 
Properties (i) and (ii) are trivial. In order to show property (iii), consider 


; r((Us ps) y. P(A, A,) 
(Us 16} — == 


P(A) P(A, ) 
=) P(A, |A) 
Remark 1.13. It is also true that for any two events A, and A, with P(A,)>0,P(A,) >0, 
P(A, NA,) =P(A,)P(A, | A,) 


and P(A, NA,) =P(A,)P(A, | A,)- 


This relation is frequently called the multiplication rule for probabilities. 


Jel 


n-l 
Remark 1.14. In general, if A,, A,, «» A, are the subsets of , n => 2, such that (f A, } 0, then 


jel 


n-2 k- 
PLA) >: 0, PIAA) > Oy, sc05 al apo. Thus rf () | are well defined for 
jel 


Ke 2,3). 2 
Result 1.6. Multiplication rule can be extended to three or more events. In particular, for events 
A,, A, A, © %, we have 
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P(A, NA, AA,) = P(A, 0 A,)OA,) 
= P(A, NA,)P(A, | A, A,) 
= P(A, P(A, | A,)P(A, | A, MA,) 


Using mathematical induction, we have for any k events A; A, ae A, 


a A, (15) 


k 
(A A, }: P(A, )P(A, | A,)P(A, | A, nanr[ A 
i=l i=l 
Remark 1.15. Sir Harold Jeffreys in his book ‘Scientific Inference’, introduced the symbol vertical 
stroke in the notation of conditional probability to mean ‘given’ or ‘assuming’ and this notation is 


now standard in statistical literature. 
Total Probability 


Result 1.7. Suppose that A,, A,, .... is a countable collection of mutually exclusive and exhaustive 
events in sample space ‘¢. If P(A.) >0 for all i, then 


P(B) = y P(A, )P(B | A,) for all B in ©. (1.6) 


j=l 


The proof is straight forward once we recall B= U (BOA,). 
jel 
Remark 1.16. This result is known as the rule of total probability. In Bayesian literature Lindley, 
following Tribus, calls it extension rule. This is useful to tackle the problem of nuisance parameters. 
Definition 1.17. Events A and B are considered (statistically) independent when the occurrence or 
non-occurrence of event B provides no information about whether event A has also occurred. 
Mathematically, we shall call events A and B independent, if 
P(A) = P(A|B) = P(AIB). 


; P(A QB) 
Since P(A) = P(A|B) = ————., we have 
P(B) 


P(A OB) = P(A)P(B). 

Remark 1.17. If A and B are independent events then 

P(AIB) = P(A) 
and P(BI|A) = PB). 
Furthermore, A and B', A’ and B, as well as A’ and B’ are also pairs of independent events. 
Remark 1.18. If P(A) # 0 and P(B) # 0, then events A and B cannot be simultaneously mutually 
exclusive and independent. 
Remark 1.19. Events A, B, and C are independent if, and only if, 


@) P(AAB)=P(A)P(B). 
Gi) = P(ANC)=P(A)PCC), 
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(iii) P(BAC)=P(B)P(C), and 


(iv) P(AMBAC) = P(A)P(B)P(C). 
If the first three conditions hold then we say that events A, B and C are pairwise independent events. 


Remark 1.20. It is important to realize that statistical independence of two events is a property of 
the probability function P rather than that of the events. 


Renyi’s axiomatic definition of conditional probability 


Definition 1.18. Let be the space of elementary events and c°/be the o-algebra of subset of & Its 
elements A, B, ... are called events. Suppose @ is a non-empty collection of events such that 
B coe. According to Renyi (1970, p.70), the conditional probability of A given B, denoted by P(AIB), 
for Ac c&/and Be %, should follow the following axioms: 

(i) | For any two events A, B; P(A|B) = 0, and P(BJB) = 1. 

(i) For mutually exclusive events A,, A,, .... and some event B, we have 


r(Ua pL P(A, |B) 


i=1 
(iii) For every triplet (A, B, C) such that BCC, P(B|C)>0, we have 


P(AMB|C) 

P(B|C) 
Remark 1.21. The axiom of countable additivity included in the Renyi’s axiom system implies that the 
probability function is continuous. However, some events may not be assigned a probabiltiy in 


contrast to the finite additivity axiom system in which every event can be assigned a probabiltiy. In 
a countable additivity system, at every point of discontinuity, x,, of some cdf F(x), we have 


P(A |B) = 


lim F(x) = F(x,), that is, the function F(x) is continuous. Thus with countable additivity axiom, 


XX, + 


asymptotic theory for situations involving conditional probability may also be developed. 

Remark 1.22. Renyi’s axiom system of probability accommodates probabilities on the entire real line 
and it is therefore closest to the system of probability required for Bayesian inference. As a 
consequence, mathematical formulation required to handle probability calculations involving improper 
distributions may be developed in a rigorous manner. 


1.5 RANDOM VARIABLE AND THEIR PROBABILITY FUNCTIONS 


Definition 1.19. Let ‘ denote the set of every possible distinct outcome of a random experiment (that 
is, % is the sample space). A real valued function X which assigns to each element ce % one and only 
one real number X(c) = x is called a random variable. The space of X is the set of real numbers 
oV= {x :x = X(c), ce G}. 
Remark 1.23. If the set “is a set of real numbers, we may write X(c) = c, so that c= %&. 

For subset A, Acc% let C be the subset of %, such that, ce & and X(c) € A. Thus C is the 
collection of all outcomes in for which the random variable X has a value in A. Thus, we may define 
a new probability function P* such that P’(Xe A) = P(C). It is interesting to note that the rv X is a 
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function which carries the probability from the sample space % to a space of real numbers c&% The 
probability set function P’(XeA) satisfies the conditions of the definition of the probability set 
function. 

Remark 1.24. P*(A) and P(C) are both probability set functions. P* is often called an induced 
probability as it is derived from the probability set function P. The probabiltiy set functions P* and 
P are, in general, not the same set functions. 


Given a random experiment with a sample space %, consider two random variables X, and X,, 
which assign to each element c of one and only one ordered pair of numbers X(c) = x,, X,(c) = x,. 
The space of X, and X, is the set of ordered pairs, A = {(x,, X,): X, = X,(©), x, = XC), ce €}. 

Let X denote a random variable with one-dimensional space c°%/ Suppose the space o°/is a set 
of points such that there is atmost a finite number of points of c°/in every finite interval. Such a set 
cf is called a set of discrete points. 


Definition 1.20. The collection of numbers {p,} satisfying P(X = x,) = p, 2 0 for all i and y p, =1 


i=l 
is called the probability mass function (pmf) of random variable X. 
Remark 1.25. A random variable X is said to be degenerate at c if PX =c) = 1. 


Let a function f(x) be such that f(x) > 0, x € c°% and that > f (x), whenever a probability set 


cot 


function P(A), Acc%%, can be expressed in terms of such an f(x) by P(A) = P(xe A) = y f(x), then 


oof 


X is called a random variable of the discrete type and the random variable X is said to have a 
distribution of the discrete type. The function f(x) is called probability mass function (pmf). 


Let the one dimensional set c°/ be such that the integral | f(x)dx =1, where 
(i) = f(x) > 0, xe c°and 
(i) f(x) has atmost a finite number of discontinuities in every finite interval, that is, a subset of the 
space of the random variable X. 
If the probability set function P(A), A Co% can be expressed in terms of f(x) by 
P(A) = P(Xe A) = | f(x)dx , then X is said to be a random variable of the continuous type and to 


have a distribution of continuous type. In this case f(x) is called probability density function (pdf) 
of the rv X. 


For a continuous rv X, we have 
Remark 1.26. The set A = {x: x = a} has P(A) = 0. 


Remark 1.27. If the intervals (a, b), [a, b), (a, b], and [a, b] are subsets of c%/ then 
Pax X<b)=P(asX<b)=P(ax X<b)=P(a< X<b). (1.7) 
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Remark 1.28. The distribution of X is not altered if the value of the pdf is changed at a single point. 
More generally, if two pdfs of continuous type random variables differ only on a set having probability 
zero, the two corresponding set functions are exactly the same. However, the pmf of a discrete type 
rv may not be changed at any point of the sample space. 


Absolute Continuity 


Definition 1.21. A real valued function f defined on [a, b] is said to be absolutely continuous on 


[a, b] if, for every € >0, there exists 5 > 0, such that y lF(x,) -f(x,)| <e for every finite collection 


k=1 


r soe . . , 
(x,, x.) of disjoint subintervals with y |x; - *| <6. 
k=l 
Remark 1.29. An absolutely continuous function is continuous. 
Remark 1.30. Every absolutely continuous function is the indefinite integral of its derivative. 


Absolutely Continuous distribution 


Definition 1.22. A random variable X has an absolutely continuous distribution if there exists a non- 


negative pdf f, such that, for any Borel set BC R, P(B)= | f (x)dx. 


B 
Distribution Function 


Definition 1.23. A real valued right continuous non-decreasing function F, defined on 
(—c0, 00), is called a distribution function if F(—°) = 0 and F(+ee) = 1. 

Remark 1.31. The set of discontinuity points of F is atmost countable. 

Definition 1.24. Let X be a continuous rv with distribution function F. If there exists a non-negative 
function f(x) such that for every real number x, we have 


F(x) = | f (t)dt, (1.8) 
then the function f is the probability density function (pdf) of rv X. 
Remark 1.32. If F is absolutely continuous and f is continuous at x, we have 


d 
— F(x) =f(x). 
dx 
Remark 1.33. Let a and b be any two real numbers with a < b, then 
b 

P(a < X < b) = F(b) - F(a) = | f (t)dt. (1.9) 
Result 1.8. Let X be a continuous rv with pdf f. If y = g(x) is differentiable for all x and either 
g(x) > O for all x or g(x)< 0 for all x, then Y = g(X) has a pdf 
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f(g (y)) 
h(y) = 


if ye (a,b) 


d | 
oa g (y) 

> (1.10) 
0 otherwise, 


where a = min (g(—ce), g(+e°)) and b = max(g(—°9), g(+00)) . 


d d \ 
Remark 1.34. Since —g ‘(y) = [2] evaluated at the point x = g'(y), we may rewrite (1.10) 
dy dx 


as 


» ye (a,b). 


x=g'(y) 


d -1 
h(y) = rea 20] 
dx 


Definition 1.25. A real valued function F of two variables, defined by F(x, y) = P(X <x, Y <y) for 


all (x, y) in IR’, is known as the distribution function of the random variables (X, Y). 
Remark 1.35. F(x, y) is non-decreasing and continuous on the right with respect to each coordinate. 
Remark 1.36. F(+00,+00) =1, F(x,-cc)=0 for all x and F(—co, y)=0 for all y. 
Remark 1.37. For every (x,, ¥,), (X, y,) with x, <x, andy, <y, 

PG, ¥) ~ Fo, ¥) + Fay) = Fa. ye 8 
Remark 1.38. The definition of the distribution function F may be extended to n-dimensional 
Vy Aa RM) 
Definition 1.26. Let (X, Y) be a two-dimensional rv of the discrete type, that takes on pairs of values 
(x, y;)» i= 1, 2,...:j = 1, 2, ..., then the joint probabiltiy mass function of (X, Y) is 

p, = PX =x, Y=y,),i=1,2,.5j=1, 2, ... 


Remark 1.39. y y p,=1 


i=l j=l 


Definition 1.27. A two-dimensional rv (X, Y) is said to be of continuous type if there exists a non- 


negative function f, such that, for every pair (xX, y)€ R F 


F(x, y) = | | f(u, v)dudv, (1.11) 
where F is the distribution function of (X, Y). The function f is called the joint pdf of (X, Y). 


0 F(x, y) 


Remark 1.40. =f(x,y). 


Oxdy 
Remark 1.41. Let us define p,. =) p, and p, =) p,, then the collection of numbers {p,} is 
jel i=l 


called the marginal pmf of X and the collection {p_,}, the marginal pmf of Y. 
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Remark 1.42. If (X, Y) is a continuous two-dimensional rv with pdf f then 


f(x) = ites y)dy and f,(y)= | f(x, y)dx are called the marginal pdf of X and marginal pdf of Y, 


so co 


respectively. 
Definition 1.28. Let (X, Y) be a two-dimensional discrete type rv. The function 
P(X =x,,Y=y,) 
P(X =x, |Y = y= — for fixed j, (1.12) 
P(Y =y,) 


is known as the conditional pmf of X, given Y = Yjp provided P(Y = ¥,) > 0. Similarly, the conditional 
pmf of Y, given X = x,, is given by P(Y = y | X = x,) provided P(X = x) > 0. 

Definition 1.29. Let f be the pdf of the two- dimensional rv (X, Y) and let f,(y) be the marginal pdf of 
Y, at every point (x, y), at which f is continuous and f,(y) > 0. The conditional pdf of X, given Y = y, 
exists and is given by 


f(x,y) 
f(x | y) =———.. (1.13) 
f,(y) 
Similarly, the conditional pdf of Y, given X = x, is 
f(x, y) 
iy | y= (1.14) 
f (x) 


provided f(x) > 0 for all x, where f,(x) = J tc, y)dy is the marginal pdf of X. 


= 


Remark 1.43. We may write f(x|y) = f(y|x) f,(x) / f,(y). This result is sometimes called Bayes theorem 
for random variables. 


Definition 1.30. The conditional distribution P(x < x|x E A) defined for any real x, where A Cc o% 


is called the truncated distribution of X. If F is a discrete random variable with pmf p, = P(X = x,), 
i= 1, 2, ..., the truncated distribution of X is given by 


P(X =x,,X€ A) _ p/P, if x,EA 


(1.15) 
ESS) 0) otherwise 


P(X Sx|xeEA)= 


If X is a continuous random variable with pdf f, then 


P(X <x|xeA)= | ioe f (x)dx - (1.16) 
(-09,x JOA A 


The pdf of the truncated distribution is given by 


f(x) /J f(x)dx ifxe A 


AG) = (1.17) 


0) otherwise. 
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Remark 1.44. Truncation of a distribution is useful in cases where the distribution function F in 
question does not have a finite mean. It is also useful when the value of the random variable X is 
not observable at a particular point or a set of points. In particular, the pmf of a zero truncated Poisson 
distribution is 


e 0" 
P(x (8) =—__—_4 x= 1,2... 
x!(l-e ) 
Result 1.9. Let Y = aX + b and a > 0, then 
—b 
E.G)=F | (1.18) 
a 


where F, and F, are distribution functions of X and Y, respectively. In particular, if X is a continuous 
rv, we have 


1 y-b 
soy) = me } (1.19) 
a 


Result 1.10. Let (X,, X,) be a two-dimensional continuous rv and let Y = g,(X,, X,) and Y, = g,(X,, X,), 
where g, and g, have continuous first partial derivatives at all points (x,, x,) and are such that, at all 
points (x,, X,)s the Jacobian 


oy, oy, 

ay., ox, Xx, 
p= 2 Ye) — a0 (1.20) 

OE) (iar Va 

ox Ox 


then the pdf of the transformed variables (Y,, Y,) is 
fy Ya) = fy x KX, )/ J]... )€ C (1.21) 


where C is the set of points Y=(Y,,Y,) such that the two equations y, = g,(X,, X,), y, = 2,(X,, X,) 


possess atleast one solution (x,, x,). 


Let (x,, x,) be the unique solution of X, = £, (Yi, = 8, Ws, ). 
Remark 1.45. It is often easy to use 


A(x,,x2) [goer] 
A(y,. Yr) A(X: %>) > 


when it is difficult to solve for X,, X, in terms of Yip Yo 
Remark 1.46. The above result may be extended to n-dimensional random variables as well. 


(1.22) 


Result 1.11. Suppose Y, = g,(X,,X,,...,X,),-.5 Y, =g,,(X,,...X,), where m<n, and (X,, X,, 
X,) is a continuous random vector. Let us consider Y XY; nas paar eee Y where x = g(X,, X,; 
X).J =m + 1, m+2, ..., n. Suppose Skee. have continuous partial derivatives at all (XK), X55 ee x) 
and are such that at all (x,, x,,..., x,), the Jacobian 
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oy, oy, 
Ox, ~ Ox, 
oy, oy, 
_ IVa Va) ax, ax, | + 0. (1.23) 
O(X,,X,5-5X,) 
dy, dy, 
Ox, Ox, 


If the pdf of (X,, X,, ..., X,) is continuous at all but a finite number of points (X,, X,,.... X,), then the 
random vector (Y,, Y Y ) is continuous with pdf 


goree 


T(E sXe hia ®, ) J| if yeC 
FY, YoY) = (1.24) 


0 ify¢C, 


where C is the set (y,, Yoorees y,) such that the n equations y,= g(x, Xyyeees X,), i= 1, 2,....n possess at 
least one solution (x,, Xen X,). 
The marginal pdf of (Y,, Viger Y,) iS 


£(Y,.Y..-5Y,) = {If ECV, Y x00 Vane Vapor ¥, GY, dy, (1.25) 


(n—m) fold integral 


Result 1.12. If (X,, X,) is continuous then 
(a) For Y=X,+X, 


co 


Ie oe (y) = | Ls, (y —X,,X,)dx, = | fg, (X,,y— x, )dx, (1.26) 


(b) For Y=X,-X, 
Pies ie Cie [fe (X,, x7 y)dx, = a (y + Xy> x, dx, (1.27) 


co 


(c) For Y=X,Xx, 


=] ' 
Foy (y)= = ~ js >. i ‘ (1.28) 
1 


X, 
(d) For Y=— 
x 


r= fix, 


xX, 


f, , (YX,,X,)dx, , provided P(X, = 0) = 0. (1.29) 
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Remark 1.47. If X, and X, are independent 


few = | fy (yf, Cdk. (1.30) 


—co 


Expectation of a Random Variable 


Definition 1.31. For a rv X, we define its expectation as 


co 


| xf (x)dx if X is absolutely continuous 


E(X) = (1.31) 
y" x,P(X=x,) if X is discrete. 


Remark 1.48. E[g(X)] exists if, and only if, | | g(x) | f(x)dx < ©, 


Remark 1.49. If c is a constant and g(X), g,(X), and g,(X) are functions whose expectations exist, 
then 


)  Ec)=c (1.32) 
Gi) E(cg(X)) = cE(g(X)) (1.33) 
Git) =E(g,(X) + g,(X)) = E(g,(X)) + E(g,(X)). (1.34) 


Remark 1.50. If E(|g(X)|) = 0 then g(X) = 0 for all x with f(x) > 0. 

Remark 1.51. If X is an indicator variable, that is, X = 1, if event A occurs and otherwise X = 0, 
then E(X)= P(A). 

Result 1.13. If the rth moment E(X°) of a rv X exists for some positive integer r, then E(X‘) also exists 
fors = 1, 2,...,r—1. 

Definition 1.32. The nth central moment of X is 


lw, = E(X-E(X))’, 
provided this expectation exists. 
In particular, the second central moment of X is known as variance of X. 
Remark 1.52. Var(X) = 0 and equality exists if, and only if, X is a degenerate random variable. 
Remark 1.53. The positive square root of Var(X) is called the standard deviation of X. 
Remark 1.54. Var(aX + b) = a’Var(X), az 0. (1.35) 


Moment Generating Function 


Definition 1.33. The moment generating function (mgf) of a rv X is defined for every real number t, 
is M,(t) = E(e*) provided expectation exists. 

Remark 1.55. If the mgf of the rv X exists for | t | < T, (for some T > 0), then E(X") exists 
(n= 1, 2, 3, ...) and 
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E(X") = “Ki (t)} . (1.36) 
dt" Li 
Remark 1.56. The mgf may not always exists even if some moments of X may exist. 
Remark 1.57. M.,.,,(t) = e"M, (at). (1.37) 
Remark 1.58. If two random variables have mgfs that exist and are equal then they have the same 
distribution functions and vice-versa. 


Conditional Expectation 


Definition 1.34. If F(y|x) is absolutely continuous or discrete, then the conditional mean of Y, given 
X =X, iS 


| yf (y | x)dy if absolutely continuous 
E(Y|X=x)=4— (1.38) 
y yP(Y =y|x) if discrete . 


Result 1.14. Let g be a real valued function of the random variable X having pdf f(x). The conditional 
expectation of g(X), given a < X <b, is 


E(g(X)|a<X<b)= | gor |a <x <b)dx 


so 


= jecxxco.s / frees (1.39) 


If X is a discrete rv, then 


E(g(X)|a<X<b)=)) g(x)P(X =x|a<x <b) 


=) gx~P(x=x)/ Yo P(XK=x). (1.40) 


(x:a<x<b} 


Result 1.15. Suppose g is a real valued function of two random variables X and Y. The conditional 
expectation of g(X, Y), given Y = y, is defined by 


| f(x, y) 
g(x,y) dx if X and Y continuous rvs 
= f(y) 


E(g(X, Y)| Y=y)= (1.41) 


P(X=x,Y=y) | . 
y g(x, y) ————————_ if X and Y discrete rvs 
: P(Y = y) 
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Remark 1.59. If X and Y are independent random variables then E(X|Y = y) = E(X). 

Remark 1.60. The graph of E(X|Y = y) is called the regression curve of X on Y, and that of 
E(Y|X = x) is called the regression curve of Y on X. 

Remark 1.61. If g and h are real valued functions of (X,Y) then 

E(ag(X, Y) + bh(X, Y) | Y = y) = aE(g(X, Y) | Y = y) + bE(h(X, YY = y) (1.42) 
where a, b are constants. 

Result 1.16. 


| E(X | y)f(y)dy if Y is continuous rv 


co 


E(X) = (1.43) 
EX |y)P(Y =y) if Y is discreterv. 


Remark 1.62. In particular, if X is an indicator random variable such that X = 1, if event A occurs 
and X = 0 if A does not occur then 


| P(A | Y = y)f(y)dy if Y is continuous rv 


so 


P(A) = (1.44) 
yP(A |Y =y)P(Y =y) _ if Y is discrete rv. 
y 


Result 1.17. E(ECKIY, Z)/Y) = ECX|Y) (1.45) 
Proof. Let X, Y, Z be continuous random variables with joint pdf f(x, y, z) then 
E(X | Y,Z) = xtc | y,z)dx 


oo 


7. 139) 
= | x ———dx = g(y,z), say 
~  fy,z) 
and 


E(E(X[Y.Z) | Y) = E(gty, z) | Y) 


= | e(y.2)f(@| ydz 


=| [eee es tO 5, 
~  fy,z) f(y) 


so 


= I ea | f(x, y,z)dz }s , on interchanging the order of integration 
y 


—c0 
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—_ : 
= [to ydx = | xf | yyax 
f(y) Bf 


= E(X|Y), 
Result 1.18. E(g(X)) = E(E(g(X)|Y)), provided E(g(X)) is finite. (1.46) 
Proof. Suppose X and Y are discrete random variables then 


EE(g(X) | Y) = {E g(x)P(X =x| Y= »| 


=e bs eas » Joc = y) 


: P(Y = y) 


=Y gw) PK =x,Y=y) 


=) g(x)P(X = x) = E(g(X)). 


The result also holds for continuous X and Y random variables. 
Remark 1.63. If E(Y) and E(g(X)) exist then E(g(X) | X = x) = g(x) and 
E(Yg(X)|X = x) = g(x) E(Y | X =x). (1.47) 
Result 1.19. If E(Y’) < 9, then Var(Y) = E (Var(Y[X)) + Var (E(Y|X)). (1.48) 
Proof. E(Var(Y|X)) = E[E(Y?X) — (E(Y|X))"] 
= E(E(Y1X)) — E(E(Y|X)? 
= E(Y’) - (EE(Y|X))? — Var(E(Y|X)) 
= E(Y’) - (E(Y)) — Var(E(Y|X)). 
= Var(Y) — Var(E(Y|X)). 


Remark 1.64. If E(Y’) < c», then Var(Y) > Var(E(Y|X)), (149) 
with equality if, and only if, Y is a function of X. 
Result 1.20. Cov(X, Y) = E(Cov(X, Y)|Z) + Cov(E(X|Z), E(Y|Z)) (1.50) 


Proof. Cov(X, Y) = E(XY) — E(X)E(Y) 
= E(E(XY)|Z) -E(E(X|Z)) ECE(Y|Z)) 
= E[(Cow(X, Y)Z) + E(X|Z)E(Y|Z)] - ECE(X|Z))ECE(Y |Z) 
= E(Cov(X, Y)|Z) + E(E(X|Z)E(Y|Z)) — E(E(X|Z)) EE(Y|Z)) 
= E(Cov(X, Y)|Z) + Cov(E(X|Z), E(Y|Z)). 


Chapter 2 


Some Special Distributions 


The purpose of this chapter is to present some of the well-known univariate discrete as well as 
continuous distributions which commonly occur in Bayesian inference. Mixture distributions, discussed 
in Section 2.3, are important since they occur as prior predictive distributions which are also normalising 
constant in the posterior distribution. Section 2.4 deals with multivariate normal, Wishart, multivariate- 
t, multinomial, and Dirichlet distributions and their useful properties. Exponential family of distributions 
and modified power series distributions are introduced in Section 2.5. 


2.1 DISCRETE DISTRIBUTIONS 
Bernoulli Distribution 


A random variable X has a two-point distribution if it takes two values x, and x, with probabilities 
P(X =x,) =8, P(X =x,)=1-0,0<6< 1. Ifx=1 and x= 0, we get the important Bernoulli random 
variable having pmf 
f(x | 6) =0*(1-6)'*,0<@<1, (2.1) 
and we shall say X ~ Bernoulli(@). 
The moment generating function is 
M(t) =1+0(e'—1) _ for all real t. (2.2) 


In particular, the first two central moments are 
E(X) = 9 and Var(X) = @(1-9). 


Discrete Uniform Distribution 


A random variable X is said to have a discrete uniform distribution on n points 
Xo Kop cons x, if it has the pmf 


1 
P(X=x,)=—; i=1,2,..,n, (2.3) 


n 


The mean and variance of X are 


BO) =—)'x, 
nN j= 
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l¢ 1g 
and Var(X) = => Wes —Xx)’, where X = -)'x, . 
N i=i 


N i=i 


Binomial Distribution 


A random variable X is said to have a binomial distribution with parameters n and 0 if it has the 
pmf 


Dn 
f(x |n,0)= 6*(1-6)""; x=0,1..,n; O0<0<1, (2.4) 
x 


and we shall say X ~ Bin(n, 8). 
The moment generating function of X is 


M(t) =(1+ @(e' —1))" for all real t. (2.5) 
The mean and variance are given by 

E(X) = nO and Var(X) = nO(1-9). 
Remark 2.1. Binomial distribution can also be considered as the distribution of sum of n independent 
and identically distributed Bernoulli random variables with parameter 0. 
Remark 2.2. The distribution function can be expressed in terms of the incomplete beta function. The 
result is 


x n 
y e*(1—6)"* =I1,.,,Q0-x,x+)D=1-1,+1,n-x) (2.6) 
k=l 

(eS are ; 
where I, (a,b) =| ————— dt, is incomplete Beta function. (2.7) 
> ~—« Ba, b) 


Negative Binomial Distribution 


A random variable X is said to have a negative binomial distribution with parameter n and 0 if 
it has the following pmf 


n+x-l 
f(x |n, 6) = 6" (1—6)*; x =0,1,2....; 0<@0<1, n2l (2.8) 
x 
and we shall say X ~ NBin(n, 8). 
The moment generating function of X is 


M(t) = a for all real t. (2.9) 
1-(1-6)e' 
The mean and variance are 
n(1— @) n(1- 9) 


E(X) = and Var(X) = 


> . 
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Remark 2.3. When n = 1, the negative binomial distribution reduces to the geometric distribution with 
pmf 

f(x | 6) = 011-6)’; x = 0,1, 2,...5 0<@<1, (2.10) 


and we shall say X ~ Geo(8). 
Remark 2.4. If Y = X + n, then 


y 
f(y |n,0) = e°(1-8)"";) y=n, n+l... (2.11) 


n-1 
with E(Y) = EX +n)=n/0, and Var(Y) = Var (X) = n(1-6)/6”. 
The moment generating function of Y is 
M(t) = (8e')"(1-(1-®)e')" for (1-@)e' <1. (2.12) 


Remark 2.5. The Pascal mass function is identical to negative binomial with n replaced by n — r and 
8 by (1-6). Furthermore, the Pascal mass function is only defined for integral n and r, whereas, the 
negative binomial mass function may be defined for non-integral n and r as well, provided n-r is 
integral. 


Poisson Distribution 


A random variable X is said to have a Poisson distribution with parameter 0 if it has the pmf 


-0 yx 
e€ 


f(x |®) = 2x =0,L..5 8>0, (2.13) 


x! 
and we shall say X ~ Pois(0). 
The moment generating function of X is 

M(t) = exp(@(e' —1)), for all real t. (2.14) 
In particular, E(X) = @ and Var(X) = 0. 
Remark 2.6. If X ~ Pois(8), then 


k-1 Q* -8 °° 1 
P(X<k-)=) ——=/ ge"dz kk =1,2yen (2.15) 
~ xt 4 re 


= 1-1, 8), 
where I(0, k) is incomplete gamma function. 


Hypergeometric Distribution 


A random variable X is said to have a hypergeometric distribution with parameters A, B, and n 
(where A, B, n are positive integers such that n < A + B) if it has the pmf 
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A B 
x | n-x 
f(x | A,B,n) = : x =0,1,...,n 
A+B (2.16) 
Dn 


and we shall say X ~ Hypergeometric(A, B, n). 

The pmf is non-zero only when x is an integer in the interval (max(0, n—B), min(n, A)). The 
moment generating function is 

(A+B-n)!B! ; 
M(t) = F(-n,—-A; B-—n+1; e ), for all real t (2.17) 
(A+B)! 

where F(e, ¢; *) is hypergeometric function. 
The mean and variance of X are 


nA nAB A+B-n 
E(X) = —— and Var(X)= ; , 
A+B (A+B) \ A+B-1 


Remark 2.7. The reason for the name hypergeometric is that the quantities on the RHS of equation 
(2.17) are successive terms in the expansion of 


(A+B-—n)!B! 
F(-n,-A; B-n+]1; 1), 
(A+B)!(B-n)! 


where F(a, B; y; z) is a hypergeometric function. 
2.2. CONTINUOUS DISTRIBUTIONS 


Uniform Distribution 


A random variable X is said to have a uniform distribution if it has the pdf 


f(x |a,b)= : ; a<x<b, (2.18) 
b-a 
and we shall say, X ~ U(a, b). 

The mean of U(a, b) distribution is (at+b)/2 and variance is (b-a)’/12. The moment generating 
function of X is 


M(t) = 


(c” -e"); t #0. (2.19) 
t(b-—a) 

Remark 2.8. In particular for a = 0 and b = 1, we have the uniform distribution defined on the interval 

[0, 1]. 

Remark 2.9. If the random variable X has a continuous distribution function F then Y= F(X) has the 

uniform distribution over the interval [0, 1]. 
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Remark 2.10. There is no unique mode but the distribution is symmetric about the point x=(a + b)/2 
which is also the median. 


Normal Distribution 


A random variable X is said to have a normal distribution if it has the pdf 


1 1 ; 
f(x | 0,0) = ex sts 0 | 0 <x <00, OE (-~, 00), 6 >0 (2.20) 
ov 27 20 


where 8 and o are location and scale parameters, respectively. We say X ~ N(O, 0’). This pdf has a 
single mode at x = 9 and is symmetric about this point. Therefore, 0 is also the median and the mean 
of the normal pdf. 

The odd order central moments about @ are all zero, that is, E[(X—6)**"] = 0 fork = 1, 2,.... 
However, 


L,, =E((x-@)")= a ree} k =1,2,.... 
1 


The moment generating function of X~N(0, 6°) is 


M(t) = ae (2.21) 
for all real values of t. The absolute moment of order k of a standard normal random variable Z is 
elzter[ At}. (2.22) 
2 Jn 
Remark 2.11. 
For Z= . , the normal pdf reduces to the standardized form 


1 -2 
f(z) = exp 5 eo < Z<oo, (2.23) 
20 2 ) 


Remark 2.12. If X,, X,, ..., X, are independent random variable such that X, ~ N(0,,0,); 
k= 1, 2, ....n, then 


y x, -n( 0.) o, } (2.24) 


= o 
Remark 2.13. If X,, X,, .... X, is a random sample from N(®, 6°) then X ~ No] where 
n 
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Generalized Inverse Normal Distribution 


The generalised inverse normal distribution GIN(@,",7) has the density 


1 . bs 
f(x | Op, t) =k(a,p,7).| x |“ aa|-(2-») jo } xeR (2.25) 
x 


with a> 0, HE R andt>0O. 


? ? 2 


we a-l 2 
x a == —1 -1 1 : 
where (k(o, B, t)) =T'e 2% 2? (>) [= : a } and ,F, is the confluent 
2 


hypergeometric function. The mean of X is 


Lognormal Distribution 
A random variable Y is said to have a lognormal distribution if X = log Y is normally distributed. 


Thus it has the pdf 
(2.26) 


1 1 

f(y | 0,6) = | ; ~ (log y 0 0<y<o 
OyvV 27 oO 

Var(logY). We shall say 


where -0<@<co and o >0 and 6=EdogY) and o? = 


X ~ Lognormal(8, 6”). The mean and variance of Y are given by 
2 


E(Y) = at ip =| and War(Y) = exp(20+ 0°) (exp(o”)-1)_ 


The mode of the log normal distribution is exp(@—o’) and the median is exp(8). 


Inverse Gaussian Distribution 


A random variable X is said to have inverse gaussian distribution if it has the pdf 


f(x|M,A)=, : exp(—A(x —p) /2u°x); x>0, A>0, u>0, (2.27) 
TUX 


and we shall say X ~ Inverse Gaussian(U, A). 
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The parameter t is a measure of location which is also the population mean while A is the 
reciprocal measure of dispersion. The variance is given by 


Var(X) = wW/d. 
The moment generating function of X is 


5 


x Ne 1/2 
M(t) = exp Qt , t real. (2.28) 
Hl ut 


32 ou 1/2 
The mode is +p} i+ ; 
20 An 


Remark 2.14. Inverse Gaussian and normal distributions have inverse relation between their cumulant 
generating functions (If M(t) is mgf and w(t) is cumulant generating function, then y(t) = log M(t)). 
Random variables X and Y are inverse variables if their cumulant generating functions K (t) and KO 
satisfy the condition 


Ki = ako), 

K, = bK"'(), 
for all t values common to the domain of both cumulant generating functions, where a and b are some 
constants and K(K"|(t)) = t. 
Pairs of distributions having cgfs K (t) and Ki) are called inverse distributions. 


It is interesting to note that the binomial and negative binomial as well as Poisson and gamma 
are inverse distributions (see Chhikara and Folks, 1989). 


Gamma Distribution 


A random variable X is said to have a gamma distribution if it has the pdf 


a 


ae 
f(x |a,b)=—x"'e"; O0<x<0o (2.29) 
Ta 


where a, b > 0 and we shall say X ~ Gamma(a, b). When a 21, the pdf has a single mode at (a—1)/b. 
For 0 < a < 1, the pdf has no mode. As a increses to infinity, for any given value of b, the pdf 
approaches a normal form. However, for small values of a, the pdf has a long tail to the right. The kth 
moment about the origin is 
T(a+k) 
b'T(a) 
In particular, E(X) = a/b and Var(X) = a/b’. 
The moment generating function of X is 


E(X*)= , k=1,2.... 


M(t) = E(e”) = [i : *) Eek (2:30) 
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Remark 2.15. For a = 1, we have exponential distribution with parameter b. The pdf is 


be” x>0 
f(x |b)= 


0 otherwise. 


We shall say X ~ Exp(b). 
The moment generating function of exponential distribution is 


mo=(1-) > t<b. 
b 


Remark 2.16. If X,; X,, Aes X, is arandom sample from an exponential distribution with parameter b 
then 


S,= y X, is a Gamma (n, b) random variable. 


i=l 
x°-distribution 


A random variable X is said to have a ¥?-distribution with n degrees of freedom if it has the pdf 


-x/2_ (n/2)A 


e x 
2 aT O<x<o. (2.31) 
T(n/2)2~ 


Note that Gamma (n/2, 1/2) is a X” with n df. The mean of X° distribution is n and variance is 2n. 
Inverted-Gamma Distribution 


A random variable X is said to have inverted-gamma distribution if it has the pdf 


a 


_ b -(atl) -b/x , 
f(x | a,b) =—x e; x>0, a, b>0, (2.32) 
Ta 


and we shall say X ~ Inverted-Gamma (a, b), 


with E(X)= provided a > 1 
ao 
b 
and Var(X) = ee Saye provided a > 2. 
(a—l) (a-2) 


There is a unique mode at b/(a+1). 

Remark 2.17. If X ~ Gamma (a, b), then Y = 1/X has an Inverted-Gamma(a, b) density. In particular if 
a = n/2 and b = 1/2, then Y ~ Inverted chi-square with n df. 

Remark 2.18. If we make the transformation Y* = 1/X, where X ~ Gamma(a, b), then 


a 


) ae 
f(y|a,b)=——y Ye ae 0<y<o (2.33) 
a 
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5 


In particular, if we let y=6, a= . and b= ae , we have 
2 ns’ ¥ ex (—ns° /20° 
f(o|n,s)= : 7 ) (2.34) 
T(n/2)\ 2 o 


1 


\2 
. The kth moment is 


where n, s > 0. It has a single mode at s 
n+l 


Chi-square and Related Distributions 


The sampling distribution of the quantity 


+ (X,-8)° =o) nS” 


m=) rs 


f(x.) Mer (x) oo 38 } x, > 0. 2 


S 
The sampling distribution of x, = Vn .— (the positive square root of xX. ) is 
oO 


<1 " n-l 1 2 
raur-{r(3 x, oof 30 } (2.36) 
2 


The sampling distribution of log x: e logn+logS’ —log 3) is 


; Oy uae ig 
roux.)=(1(2 | (<)' e9{ 52 | (237) 


n : n 2 1 2 2 
=|T| — |2? | exp — log x; -—exp(log x; ) , 0 < logy, <0, 
2 2 2 


is 


and 
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XxX, => we get the distributions of inverted chi-square 


On making the transformations x. es 
X, , 
and inverted- chi as 
: n Pe inte % 
f (x2 ) = tGa ) (x. ) (: xf - - } x, >0 (2.38) 
2 2 
n (n+1) 1 
Al n/2-1 -1\ ai 
f (x, )=(+(5} (x, ) exp} — aN Xi >0 (2.39) 
2 2(x,") 
Beta Distribution 
A random variable X is said to have a Beta distribution if it has the following pdf 
¥ a-l x b-1 
f(x|a,b,c)= >; O<x<c, (2.40) 
cB(a,b)\ c c 
where a, b, c > 0 and B(a, b) denotes the beta function. We can obtain the standardized beta pdf by 
making the transformation z = x/c. We obtain 
f(z|a,b) = z (l—-z); O<z1, (2.41) 
B(a, b 
and we shall say X ~ Beta(a, b). 
—1) 
For a, b > 1, the mode of the standardized beta distribution is ( b-2) . The moments of the 
at+b—- 
standardized beta pdf are given by 
B(k +a,b a(a+1)(a+2)...(a+k—-1 
( ) = ( M dest ) tk S11, Qyscc (2.42) 
B(a, b) (at+b)(a+b++l)..(a+b+k-—-1) 
(2.43) 


E(z‘) = 
ab 


In particular, 
and Var(z) = : : 
(a+b) (a+b+l1) 


E(z) = 
b 
If b = a, the pdf of standardized beta distribution is symmetric, the skewness is positive when b > a 


at 
(2.44) 


and negative if b < a. The mgf of X is 
. t T(a+pr(a+) 
T+) Mat+p+pra— 


M(t) =Ee")=)) 


i=0 
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Remark 2.19. The standardized beta distribution reduces to the uniform distribution on the interval 
(0, 1) fora=b=1. 

Remark 2.20. If X is a beta random variable with parameters a and b then Y=1-X is also a beta random 
variable with parameters b and a. 

Remark 2.21. If X . X,, sig X, is arandom sample from uniform distribution on the interval [0, 1] and 
xX ” is the kth order statistic then the rv ay has a Beta(k, n—-k+1) distribution. 


Remark 2.22. If we let z = 1/(1+u) in (2.41), then 


b-1 


f(u|a,b) = 0<u<e, (2.45) 


u * 

B(a,b) (itu) 

with a, b > 0. Its kth moment is given by 

B(b+k,a—k) 
B(a, b) 


This distribution is known as inverted-beta distribution and we shall say X ~ Inverted-Beta(a, b). 


E(u‘) = sk<a. (2.46) 


The inverted beta distribution has a single mode at , provided b > 1. 


atl 
Remark 2.23. If we let u = y/c with c > 0, we have 
i y\" 1 
f(y | a,b,c) = =i USsy<eya,b,c>0 
cB(a,b)\c¢ ) (1+y/c) 
1 oy 


= (2.47) 
B(a,b) (c+ y)*”" 


which is a 3-parameter inverted beta distribution denoted by Inverted-Beta(b, a, c). It may be noted 
that the Fisher’s F and Student’s t pdfs are its special cases. 


Pareto Distribution 


A random variable X is said to have a Pareto distribution if it has the pdf 


f(x]a,b)=ba’x "; O<a<x (2.48) 


and we shall say X ~ Pareto(a, b). 
The mean and variance are 


E(X) = a : provided b > 1, 
1 


2 


_ ba” 
(b-1)°(b-2) 


The median is 2'”a and the mode is at a. 


and Var(X) , provided b > 2. 
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t-distribution 


A random variable X has a t-distribution if it has the pdf 


raiany =[a(>2¥]| *] [+20 | tess (2.49) 
2. (2 Vv Vv 


where -c0o << 8< 0, 0<h<oo, 0<v. This pdf has three parameters 0, h, and v. 


The t-distribution has a single mode at x = 0 and is symmetric about the modal value x = 9. 
Further x = 0 is the median and mean (for v>1). The (2k—1)th central moment is 


Hoe = E(x a @)"*" =0, k= 1, 2; .. and v > 2k— 1, 


but, the even order central moments are 


1 Vv 
rer eS] 
2 2 Vv 
LL, = [ ) = 12am 92% 
. 1 Vv h 
Tl) — | — 


In particular, EX) = 8 exists only if v > 1. 


1 Vv 
Var(X) =p, -=(i} for v>2 
Vv h 


aa! 
3 vy 
and H, = , for v>4, 
(v—2)(v—4)\h 
The kurtosis is given by 
6 
y= 18 = ——., provided v > 4. 
WS v—-4 


As v — ©, the Student’s t pdf assumes the shape of a normal pdf with mean @ and variance I/h. 
Remark 2.24. The random variable X having the pdf f(x |®, h, v) is said to have a t-distribution with v 
df, location parameter 9, and precision h. It is important to note that the parameter h (the precision of 
the t-distribution is not the reciprocal of the variance of the distribution). 

Remark 2.25. Note that for v = 1, the t-distribution is identical to the Cauchy distribution for which 
the first and higher order moments do not exist. 

Remark 2.26. In particular, if 9 = 0 and h = 1, the pdf is called the standardized t-distribution with v 
df. Its pdf is given by 


ice vtl 
2 x \2 
f(x|v)= ; (2.50) 


mr) v 
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Remark 2.27. If X has a standard normal distribution and Y is independently distributed as a 


xX 
yd - distribution with n df then —————\ has a Fisher’s t-distribution with n df. 
Y 


i 


Vn(X —w) 


Remark 2.28. If X,, X,, .... X, is a random sample from N(u, 6°) then has 


= a, =") 


n-l a 


a Student’s t-distribution with (n—1) df. 
F distribution 


A random variable X is said to have F distribution with (n,, n,) degrees of freedom if it has the pdf 


7 As x {st 

n,n n, ? <4 n,x ( 2 

f(x |n,,n,) =| B) —,— +] x? |14+—— : O<k <=, (2.51) 
2 2 n, n, 


. . n,/ n,-2 
where n,n, > 0. If n, > 2, it has a single mode at — ; 


n, | n,+2 


Remark 2.29. It is a special case of inverted-beta pdf for a = n/2, b=n,/2 andc=n//n, . 
Remark 2.30. For n, = | and t’ = x, the F pdf reduces to standardized t pdf with n, df. 
Remark 2.31. If X, and X, are independent random variables with x* pdfs having n, and n, df, 


joi, 


respectively, then has an F pdf with n, and n, df, provided n,, n, > 0. 


ft, 


2.3. MIXTURE DISTRIBUTIONS 


Let the random variable X has a pdf f(x | 6) and suppose that the parameter © is a discrete 
random variable taking values 0. 0, ies 0... If the distribution of © is such that P(® = 8.) =p, then 


k 
the unconditional (marginal or compound) distribution of X is m(x) = DB pf (x | 0.) . This is called 
i=1 
a mixture of the distributions f(x | 0.) with weights P;, i = 1, 2,..., k. The above definition may be 
extended to the case for infinite k. 
Remark 2.32. It can be generalized to the case when the parameter 9 is absolutely continuous random 
variable having pdf g(@). We shall have, then, a continuous mixture of densities f(x | 6) with weight 
function g(@). In this case, the unconditional distribution of X is 


m(x) = | f(x | @)g(@)d0. (2.52) 
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Remark 2.33. The marginal distribution m(x) appears in the denominator of the posterior distribution 
g(8|x) as a normalizing constant. It is also known as prior predictive distribution and used in empirical 
Bayes estimation. 


Compound Poisson 


Suppose X ~ Pois(®) and 8 ~ Gamma(a, b). The compound Poisson distribution is the mixture of 
Poisson with gamma distribution as weight function. Its pmf is 


°° -8nqx a _—b0,qa-l 
e 0 be 9 


m(x | a,b) = | dé 
Ta 
b’ Tat atx-l\/ pb Y b \) 
oe 1 . 2003.5 (2.53) 
x!Ta (b+1)*™ a-l b+1 b+l1 


which is a negative binomial distribution with parameters a and b/(b+1). 
Compound Binomial 


Suppose X ~ Bin(n, 9), 8 known and n ~ Pois(A). The compound binomial distribution is the 
mixture of binomial with Poisson distribution as weight function. Its pmf is 


a “Ann 


m(x |=) Paso 


n=0 Xx 


_e (08) GS (ACL -8))"* 


x! ey (n-x)! 
-rO x 
e (AO) 
= e X= 0: bein. (2.54) 
x! 


which is a Poisson distribution with parameter 0. 
Hypergeometric-Binomial 


Suppose X ~ Hypergeometric(N, Y, n) and Y ~ Bin(N, 9), 8 known. The marginal distribution of 
X is the mixture of hypergeometric with binomial as weight function and is given by 


m(x|@)=)) g(y|N,@)f(X|y) 


6° (1-6)** 
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n )\N-n+x N-n 
=| | e’(1-@)*” 


Xx y=x y-x 
n 

= e*(1-8)"*, x=0,1,2,...n. (2.55) 
Xx 


It is a binomial distribution with parameters n and 0. 


Normal-Normal 


Suppose X ~ N(@, 6’), 6? known and 6 ~ N(u, 7’), then the marginal density of X is given by 


co 


m(x |p, 7) « | exp] (3-0) Jexo| 0-1 Jao 
20 2T 


- 1 — F es) } 4g 
2 om T 


« o0| st o-w | (2.56) 
2(6 +7 ) 


Therefore m(x|u,t) is N(U,o +7). 
Remark 2.34. We can obtain the same result using iterative expectations as follows: 


M(t) = E(e”) = E/E (c* | 6) | =E| exp(t0 + t’o* /2) | 


R 
{;—3 
oO 
ps 
sc) 


=exp(t’o” /2)E(e") = exp(t’o” /2) exp(tu + t’t” /2) = exp(tutt?(t? +67)/2) 
Hence by uniqueness theorem of mgf, the marginal distribution of X is N(U, t’+0°). 


Gamma-Gamma 


Suppose X ~ Gamma(m, 8), m known, and 8 ~ Gamma(n, A), then the marginal density of X is 
given by 


m(x|n,A)= | f(x |®)g(0| Ade 


x"! T —6(x+A) Qm+n-1 
= i g gs a9 
TmIn *, 
1 1 x" 


7 B(m, n) A” x 
1+— 
xr 
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cy. Ae 
B(m,n) (A+x)"" 


x>0 (2.57) 


which is Inverted-Beta(m, n, A). 
Remark 2.35. If we replace m and n by m/2 and n/2, respectively, and write 1 = n/m then it reduces 
to the pdf of Fisher’s F-statistic and 


x? 
m(x | m,n) = FE : x >0 (2.58) 


=) 


with (m, n) df. 
Normal-Gamma 


Suppose X ~ N(u, 1/6) and 8 ~ Gamma(a, b), then the marginal density of X is given by 


co 


m(x|a,b)=| f(x | @)g(0|a,b)de 


wc 74 owe 7) (2.59) 
2b 


This is a kernel of a 3-parameter t-density with 2a df, location parameter |, and scale parameter (a/b). 
Binomial-Beta 


It is a mixture of binomial(x | n, 8) and beta(@ | a, b) distributions. The marginal pmf of X is given by 


m(x|a,b)=| f(x |@)g(@|a,b)d8 


0 


a NEU bance ee isch, (2.60) 
ene 


The mean and variance are given by 
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na nab at+b+n 
E(X) = and Var(X) = - 
a+b (a+b) \ a+b+1 
_ (nt1(a-1) 
ms (atb—2) 


ol are modes. (Refer, Aitchison and Dunsmore 


The mode of the distribution is [x1 where X and [-] is the greatest integer 


function. If x is an integer, then both X 


mode 


(1975), page 48). 
The moment generating function of binomial-beta distribution is 


and x 
ie m 


di od 


a n : 
Moey | eae (2.61) 
ae B(a,b) 


Remark 2.36. In particular, if 8 has a uniform distribution over the interval [0, 1], that is, a= b= 1, 
we obtain the marginal distribution of X as a discrete uniform distribution assigning mass 1/(n+1) to 
each value of X belonging to the set 0, 1, 2,..., n. 


Negative Binomial-Beta 


It is a mixture of negative binomial(x | 9, r) and beta(O | a, b) distributions. The marginal 
distribution of X has a pmf 


r+x-l 


-1 Jr. 
r | aaa dl = a do 


m(x | a,b) = 
B(a,b) + 
r+x-l 
-1 
— Z B(a+r,x+b); x=0,],.... (2.62) 
B(a, b) 


with mean and variance as 


br br | a+b+r-1 br 
E(X) = — and Var(X) = + : 
a a-2 (a —1)(a—-2) 


provided a > 2. 
Beta-Pascal 


The beta-Pascal normalised probability mass function is defined by 


m(n|1,,n,,1) =| f(n|9,r)g(6|1,,n, dO 


0 
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p'(-8)"" do 
r-l B(r,n, —-1) 


a eae 6° '(1-6)" 
| 


n-1 Birt+r,n+n, -(t+4,)) 
= > nr=12,..5 n24r, n, >7 >0. (2.63) 
r-l B(,n, —1) 


The first two moments are 
(n, —D)(@m, -4) 
(r, -1)° (1, — 2) 


r-l 


E(n) = ic a and Var(n) = r(r +4, —1) 


2.4 MULTIVARIATE DISTRIBUTIONS 


Multivariate Normal Distribution 


A k-dimensional random vector X = (X, Rusch) has a non-singular multivariate normal 


distribution with mean vector @ and covariance matrix V if X has a pdf, 


1 / an 
f(x lO,Vj=—Qn)” |v |? exp] 0- 6)'V'(X- 0). xe R* (2.64) 
where V is a kxk symmetric positive definite matrix. We shall denote it by X ~ MVN(®,V). 


Properties: 


(1) Suppose A is a given mxXk matrix and we consider the linear transformation 


Y =a+AX, then Y ~ MVN(a+ A0,AVA’). 


(2) Suppose that the k-dimensional random vectors X and @, and covariance matrix V be partitioned 


as 


~ 

i 
~ | 

1@ 

i 
jen) i 

< 

i 
a | 
et 
a is) 
Lees 


where X, = oe are X, | ,X,= [x, wx Xx, ie | , k=k,+k,, and 8,, 9, are corresponding 


k,+27°°° 


mean vectors. The marginal pdf of X, is MVN (8,, V,,) and the marginal pdf xX, is MVN (8,,V,,). 
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The conditional distribution of X,|X, is MVN(,,V,,-V,V.sV,,) where 


v, = 6,+V,,V, (X,-9,). 


12 © 22 


Remark 2.37. The mgf of a MVN(0,V) is 


1 , 
M(t) = oo[ tort ve} (2.65) 
‘ -s gee 
Remark 2.38. The precision matrix W of a non-singular multivariate normal distribution is defined to 
be the inverse of the covariance matrix, that is, W = V™. 
Remark 2.39. The pdf of MVN (9, W), where W is a precision matrix, may be written as 


1 


pa 1 ; : 
f(x |8,W) = (22) *" | WI? eats 0) wox-0)], XeR (2.66) 


Remark 2.40. If C is a k Xk non-singular symmetric matrix such that C'VC is an identity matrix of order 


k and employ the transformation (X—- 8) = CZ then E(Z) =O and E(ZZ’)=1. The multivariate pdf 


of Z is referred to as standardized multivariate normal distribution. 


Remark 2.41. If V~' is singular, then the distribution is improper and does not integrate to one. 
Wishart Distribution 


The Wishart distribution is the multivariate generalization of the univariate gamma distribution 
and is used as the natural conjugate prior for the precision matrix of a multivariate normal distribution. 


A kxk symmetric positive definite random matrix W is said to have a non-singular k-dimensional 
Wishart distribution with positive definite scale matrix T and n degrees of freedom, k <n, if the joint 


distribution of the elements of W is continuous with density function 


(n-k-1)/2 


c|W 
i, T= |T 


0 otherwise. 


n/2 


ox TW | if W>0, T>0 


The normalizing constant c is defined as 


nk k(k-I) ola ra 
jel 2 


We shall denote it by W~Wishart(n, T). 
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Properties: 


(1) If X,,X,,....X, are kx1 iid MVN(0,T) random vectors and X = (X,, X,,....X,) 


~n 


is kxn matrix, kSn, then W = XX’=)" X, X’ ~ Wishart(n, T). Let W = (w,) and 


i=l 
T = (t,); i, J=1,2,....k. Then 
E(w,,) =nt,,, 


2 
Var(w,) =n(t, + tit), 


Cov(w,, Wy = n(tiyt + tity ), 


are means, variances and covariances of the components of a Wishart matrix W. 
(2) Partition W and T as 


W,, We Ty T, 
We , T= > 
W,, We T,, i 


where Wi, and A lee are MXm matrices, then 
W,,~ Wishart(n, T,,) 
and 
W,, = W,,- W,,W,,'W,, ~ Wishart(n-k+m, T,,), 
where T,,, =T, - Tak, gs 
(3) W,,and W 
(4) Suppose A isa mxXk constant matrix with m<k. Then 
Z = AWA’ ~ Wishart(n, ATA’). 


are independent. 


11.2 


Remark 2.42. A random kxk matrix W has a Wishart(n, P) distribution with n df (n>k—1) and 
symmetric positive definite precision matrix P, if its pdf is 


n n-k-l 
7 = 
f(W|n,P)=C|P)?| WwW] ? on( pw | (2.67) 
2 


Inverted-Wishart Distribution 


The inverted-Wishart distribution is the multivariate generalization of the univariate inverted- 
gamma distribution and is used as the natural conjugate prior for the covariance matrix of a multivariate 
normal distribution. 


A kxk symmetric positive definite random matrix V is said to have a non-singular k-dimensional 
inverted-Wishart distribution with positive definite scale matrix G and n degrees of freedom, if the pdf 
is 
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(n-k-1)/2 
Cc 


| 


n,G) = vl"? 


1 
aa{ Soe if V>0, G>0, 2k<n 
2 


f(V 


0) otherwise. 


The normalizing constant c, is defined as 


(n-k-1)k k(k-1) k n —-k- . 
| (} 
jel 


We shall denote it by V ~ Inverted-Wishart(n, G). 


Properties: 
(1) If W ~ Wishart(n, G) and V = W", then V ~ Inverted-Wishart(nt+k+1, G"'). 


ll 12 


VvOUV G, G, 
(2) Let V= , and G= .V,, and G,, are mx m matrices, then for m < k, 
Vv, WV G G 


21 22 21 22 


V_, ~ Inverted-Wishart(n-2k+2m, G,,). 
Normal-Gamma Distribution 


Normal-gamma distribution is a bivariate distribution of the random variables 
(X, Y), such that the conditional distribution of X, given Y = y, is N(u, my) and the marginal distribution 
of Y is Gamma (a, b), m > 0, with parameters uy, m, a, b. Its pdf is given by 


f(x,y) =f(x|y)f(y) 


ey 1 5 
oy ? ool (x — pH) Jeneon: X € (-00,00), y >0 (2.68) 
2my 


Normal-Inverted Gamma Distribution 
Suppose the conditional pdf of X, given Y = y, is a k-variate MVN (8,yV) and the marginal 


pdf of Y is Inverted-Gamma(a, b), then the joint distribution of X and Y is (multivariate) normal-inverted 


gamma distribution and the joint pdf of X and Y is 


f(x,y) =f(x|y)f(y) 
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= v0] a-9v" 0) foo( =] 
2y = 7 . 
vl 1 a | res 
meso] fos boo Vv wo /y oe 
y 


where V is a kxXk symmetric positive definite matrix. 


Remark 2.43. The mean and variance of the marginal distribution of X are 


E(X)=E(E(X|y))=E@)=8 
and 


Var(X)=E (Varcx | y))+ Var (E(x | y)) 


Vb . 
= E(YV) + Var(6) = VE(Y) = ——. provided a > 1. 
a-l 


Remark 2.44. If we let the variances of the elements of X tend to infinity, that is, letting V' >O, 


we have 


f(x,y) oo] =| / ye (2.70) 
y 


Note that here X has an improper distribution. On the other hand, if we let a = b = 0, then 


f(x,y) = a Nata ee 0 | i ye (2.71) 
x ee 


Further, if we let V"' —> O in (2.71), we have 
fe 
f(x, y) ee y “ * 


Normal-Wishart Distribution 


It is a multivariate generalization of normal-gamma density. It is a joint distribution of a mean 


vector 9 and precision matrix R such that the conditional distribution of 8 when R =r is a 


MVN (u.vr), v>0, and the marginal distribution of R is a Wishart(, P) distribution where P is a 


symmetric positive definite precision matrix. The joint pdf of (8,R) is 
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£(@R)«|R 


¥ exp| 4 {(@-n) +(0-u)v +R} 


Remark 2.45. The marginal distribution of @ is multivariate t with o df, location parameter Lt and scale 


parameter P. 
Normal-Inverted Wishart Distribution 


It is a generalization of the normal-inverted gamma density. It is the joint pdf of the mean vector 


6 and covariance matrix V such that 


f (@|V) is MVN(u,b'V), V>0, 


and 
f (V) is Inverted-Wishart(n, T), n>2k, T>0. 


The joint distribution of (@,V) is 


tel e-atvtaoerl 


f (0,V) |v" 


Multivariate t-distribution 


Suppose that the k-dimensional random vector X = (X,,X,, ee. ant has a MVN(O, W), W being 
a precision matrix and that the random variable Y has a chi-square distribution with n df, and that X 


and Y are independent. Suppose also that 6 = (0,,0,,....0,) is any given vector in R . Consider a 


random vector Z=(Z,,Z,,..Z, y defined by the equation 


1 
Z, -x( =] "Gp 7249: oak, 
n 


The distribution of 7 is called a multivariate t-distribution with n df, location vector 0, and precision 
matrix W. The pdf of Z is given by 


n+k 


FzInawy=c[ 1+ awe | me 
0 NS 


(2.72) 
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n+k 
Tr} —— || Ww 
where C= 2 


7 n k/2 
Tl} — |(nn) ~ 
] 


The mean and covariance matrix of 7 are 


1/2 


E(Z)=0 and Cov(Z) = W'yn>2. (2.73) 


n- 


Remark 2.46. If Z is partitioned into sub-vectors Z,,Z, of dimensionality k, and k, (k, = k—k,) and 


accordingly @ and W are partitioned, then the marginal distribution of Z, is a k-dimensional 
multivariate t-distribution with n df, location vector 0. and precision matrix (w,, = W,,W,'W,, ) and 
the conditional distribution of Z, , given Z, , is also a k-dimensional multivariate t-distribution with 
(n + k,) df, location vector 9 = W, W,,(Z,- a) and precision matrix 


(n+k,)W,, 
n+ (Z,—9,)(W,, ~ W,,W,'W,,(Z,-8,) 


Remark 2.47. The standardized form of the multivariate Students’ t-pdf is obtained by making the 
transformation Z-@=hT such that h'Wh = |. It is given by 


ao o+k 
n T =. n+k 


(n+T TT)? 


f(T |n,k) = ) TeR*. (2.74) 


n 
Here E(T)=0 and Cov(T) =——I,,n>2. 
° - n-2 
Remark 2.48. The multivatriate Student’s t may also be obtained as a mixture of multivariate normal 
and inverted-gamma distributions. 


Multinomial Distribution 
Let @,(0< , <1) be the probability that the outcome of the random experiment belongs to the 


k 
ith category (i = 1, 2, ..., k) such that ye, =]. Suppose that the experiment is performed n times and 


i=l 
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the n outcomes are independent. Furthermore, let X, denote the number of those outcomes that belong 
k 

to category i (i = 1, 2, ..., k) such that yx, =n. Then, the random vector X = CE 5 oe) has 
i=l 


a multinomial distribution with parameters n and @ = (0,,9,,..., 0.) with pmf 


i] k 
f(x |n,)=———9°6"..0", ifn=)ix, | (2.75) 
i=l 


xX, 1x, !ox,! 


where x,’s are non-negative integers. 
The mef of (X,, X,, oe a) is 


wo K=1 k-l 


k-1 
M(t, s tisest )=( ae! +0, | to tol jet, 5 = Oy 
In particular 


M(t,,0,...,0) = (@e" +0, +...+0,) =(@e" +1-6,) , 


is the mgf of Bin(n, 8,) distribution. 
The mean, variances, and covariances of the multinomial distribution are 


E(X) =n@ 


Var(X,) =n0.(1-8,); i=L...,k 
and 


Cov(X,,X,)=—n0,9; 1,j)=1,2,.,k, 143 


k 
Remark 2.49. Since y X, =N, we can rewrite the multinomial pmf as a (k—1) dimensional distribution 
i=1 
by eliminating one of the k rvs X,, X,, ..., X,. The joint pmf of X,, X,,..., X,_, is given by 
n! = 
f(x |n,8) = G6 dre, yee Ye <n (2.76) 
7 . ! isl 


Ix! ae 
x, !x, l(a —-x, —X,...K, 


ae ae 
Remark 2.50. If X ~ Bin(n, 8), then (X,n-X) has a multinomial distribution with parameters n and 


(01-6) . 


Remark 2.51. The multinomial distribution reduces to binomial distribution for k = 2. 


Dirichlet Distribution 


A random vector X = (KX Xun X,) has a Dirichlet distribution with parameter vector 


OS (00 8) 0,0; ta 12.24), at 


44 


k 


é k 
f(x|®) =— = II xf", x, >0i=L..k and y! x, =l. 
y? T(0,) i i=l 


a 
=I 


E(X,)=6,/8,, where 0,=)) 8, 


Var(X,) = 6,(0, -,)/(@;(8, +) 


and Cov(X,,X,)=-0,0,/(6;(0,+)), ij. 
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Remark 2.52. If Xx ~ Beta(a,b) then Y a (X, 1- x)’ has a Dirichlet distribution with parameter 


vector @ = (a,b)’. 


k 
Remark 2.53. The marginal distribution of X, is Beta fo, ss 0-8, ) 


i=1 


k 
Remark 2.54. Since >: x, =1, the Dirichlet density may be rewritten 


rere,..r6, Fe 


where D(@’,0,) = K @' =(6,,0,,...,0, ,),0, > 0,i=1,2 


We shall denote it by Dirichlet(® ,@,)- 


6-1 


i 


The mode of the marginal distribution of X, is a : 
=k=2 
0 


as 


saws 


Remark 2.55. In particular, for k = 2, Dirichlet distribution reduces to the beta distribution. 


Dirichlet-Multinomial Distribution 


This is a multivariate generalization of beta-binomial distribution. If X has a Multinomial(n, ) 


Some Special Distributions 


45 


and @ ~ Dirichlet(u,u,), b = (U,,...,4,_,) then the marginal (compound or mixture) distribution of X 
has a pmf 


m(x | n, 1) =| seal f(x | n,)g(0 | H)dO; x, =0,....n, i= 1... 


k-1 
where the integral is over the set A = (< v5 9); 0 > 0,1=1,...,k—1, < i 


k-1 k-1 >) xi Heol 
«f [J] ° xa " i Q" 2 dO, ,.d0, ,...,d0 


Hence, 


n p[ursn-Z X, 4] 
m(x | n,w) = isl 


x D(, u, ) 


(2.77) 


which is Di—Mu(n,u, u, ). 


Distribution for Correlation Coefficient 


Suppose X, and X, are two rvs whose joint distribution is a BVN with precision matrix r, where 
the elements of 2x2 matrix r are defined as 
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q, qT, 
r= 
> 
tq, I, 


-1 
r —_——— 


| r | Tt, qT, 
Therefore, the correlation p of X, and X, is 
— Tho / | r | _ 


We know that the distribution of the precision matrix r is a Wishart(a, Tt) distribution such that a > 1 
and T is a symmetric positive definite precision matrix. 


Let us write T= 


The joint pdf f (r,,, r,,, f,,) is 


12? 


ee Jeol) |e 


11 >" 12:9 “22, 


=I 


1 
-+ at+l—j 
where c= 2*n |] fy eae , (sincen=Q andk=2) 
2 


jel 


° 2 
provided r, >0, r, >0 and rr, —r, >0- 


a a-3 


- , 1 
Gj tt J=e|t) , t—8,)> ep —— 0, Fh, $2) 
” 7 2 


11? S12.2'92 1111 22°22 12°12 


In order to obtain the distribution of the correlation coefficient p, let us define y = . Then 


1/2 


_ ll 2 = 11 
r= IY and ft, =— t,YP, we have 


22 22 


B(4,,y5P)= f(c,,%,5%,)|I| 
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Since 


J= 


t, - a-l 2 = tT, 
=26——|t|) Gay Gap )* 
i. 2 


dr, or, Ot, 
or, dy op 
Aytart) _[O. A, A 
ers or, dy op 
o,, OF, OF, 
or, dy op 
t, t. 1/2 r 
1 2} 2 
py t, 


On integrating with respect to r,,, we have 


e(y.p)=| e(n,.y,p)dr, 


0 


t,, a/2 a-3 
x tab) dp)? 


yar 


exp} —— 


1/2 
: t 
t,(l+y )-2t,, [*) yp 
t, 


1/2 
Te 
‘exp a t_(+y)- x, (S yp 
t,, 


y” T(a) 


1/2 
, t 
ty )— 21, [ yp 


a 
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: c(-p*)? 
y(y+y —2t,, (t,t) p)° 


a/2 
# a+] 7 
where c =c2 Ce) T(Q). 
t 


11 "22 


(2.78) 


We may obtain the marginal pdf of p by integrating out y from g(y, p). 
In particular, if t,, = 0, we may obtain 
=p 2 
g(p) =| ————-dy (2.79) 
0 Ylyty ) 


a-3 


which can be expressed as g(P) = e'(1-p’)? , |p| <1, where c" is the normalising constant. 


2.5 FAMILIES OF DISTRIBUTIONS 
Exponential family of distributions 


A family of distributions on the real line with probability density function (or probability 
mass function) f(x | 6), 86 OCR, is said to be one-parameter exponential family of distributions 
if 

f(x | 8) = v(®)u(x) exp[co(@)h(x)] (2.80) 


The function v(9) is a normalizing constant and is determined by the functions u(x), 0(8), and h(x), we 
have 


y? u(x) exp(co(®)h(x)) if x is discrete 


(v(®))' = 
| u(x) exp(co(®)h(x))dx if x is continuous , 


xeS 


Remark 2.55. The family is called regular if the sample space S does not depend on 9, otherwise it 
is called non-regular. For example, U(0, 9) distribution is a member of the non-regular family. 
Example 2.1. Let X ~ Bernoulli(9), that is, 
f(x |@)=0'(1-6)*; x=0,1,; @e€ (0,1) 
= exp(xlog@ + (1 —x) log(1-6)) 
= exp(x log(@/(1-9)) + log(1-6)). 
Since u(x) = 1, v(8) = (1-6), c = 1, (8) = log(®@/(1—9)) and h(x) = x, it belongs to regular one-parameter 
exponential family of distributions. 
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Example 2.2. Let X ~ N(0, 8), where 0 is precision, then 


| 8 6, 
f(x | 0) = see(-$3 } 
20 2 


-] | 
,v(8) =0' c= ,0(6) = 8 and h(x) = x’, and range of x is independent of 9, 
2 


Here u(x) = ! 
V2n 
Therefore it also belongs to regular one parameter exponential family of distribution. 
Example 2.3. Let X ~ U(0, 8), f(x|®) = 1/0, x € (0,0), 8 >0. Hence u(x) = 1, v(8) = 1/0, c = 1, o(8) =0 
and h(x) = 0. 

However, the range of X depends on the parameter 9, therefore, it belongs to non-regular one- 
parameter exponential family of distributions. 
Remark 2.57. Exponential family of distributions is also known as Koopman-Pitman-Darmois family of 
distributions. 
Remark 2.58. If X,, X,, ..., X, is a random sample from a regular one-parameter exponential family of 


distributions then T = wires ) is a sufficient statistic for 8. The sufficient statistic T has pdf 


i=l 


f(t | 8) = v, (®)u, (t) exp[co(6)t]. (2.81) 
Remark 2.59. A convenient form, known as cannonical form, is obtained by replacing co(®) by n, so 
that 

f(x |) = u(x)B™) exp[Nh(x)]. (2.82) 


The moment generating function 


M(z) = E(exp(zh(x))) = B(n)/ B() +z). 
Example 2.4. Let X ~ Bin(m, 8); m known, then f(x | 8) is a member of one-parameter exponential family 
with 


m ) 
v(8) = 1-8)", u(x) = , (8) = log —,h(x)=x, c=l. 
x 1-0 


) oe 
Putting n = log] —— |, B(y) = ; 
1-6 l+e" 


and, therefore, moment generating function of h(x) = x is 


mioy=( SE) = (1-0+6c")” 


Ite 


Remark 2.60. Some of the well-known members of this family of distributions are Poisson, binomial 
(with n known), normal, gamma, etc. However, Cauchy distribution cannot be expressed in the form of 
equation (2.82) and, therefore, it is not a member of exponential family of distributions. 
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Remark 2.61. E(h(X)) = — B'\(y) / B()). 
Remark 2.62. The form of the pdf given in (2.82) is not unique. We can, for example, multiply 0(8) by 
a constant c if at the same time h is replaced by h/c. Some authors prefer to write f(x | 8) as 

f(x | ) = u(x) exp [(6)h(x) — A(6)]. (2.83) 
Remark 2.63. The definition of a regular one-parameter exponential family of distributions may be 
extended to the regular k-parameter case, (k = 2) having a pdf (or pmf) 


F(x | 9) = wovern] Y 4,0).00 (2.84) 


i=l 
with cannonical form 


k 


f(x | m) = u@)B) pS nh, 00 | (2.85) 


i=l 


The moment generating function is 


M(Z,.Zys.nZ,) =E (e214) B(n) /BOn+ Z) (2.86) 


where 1 =(N,,1,>--»1,) and Z=(Z,,Z,,..,Z,). 


Remark 2.64. The N(u, 6’) is a two-parameter regular exponential family whereas Bivariate-Normal 


(U,, M,, o, c,, p) is a five-parameter regular exponential family. 


Modified Power Series Distribution (MPSD) 


A discrete random variable X is said to have MPSD if its pmf is given by 


a(x)(g(@))" 


f(x |8)= ie 


eS, (2.87) 
where S is a subset of the set of non-negative integers, a(x) > 0; g(®) and f(8) are positive, finite, and 
differentiable functions. This class of distributions includes, among others, the binomial, the Poisson, 
the logarithmic series, the negative binomial, the generalised Poisson and power series distributions. 
The mean of MPSD is 


x) -£ Oe) _ 


f(@)g(0) 
and variance is 
6) du’ 
yates 
g (0) d0 


Remark 2.65. The MPSD can be expressed as a member of exponential family of distributions since 
f(x | 8) = a(x) c(8) exp (x log g(8)), where c(6) = (f(8))". (2.88) 
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Power Series Distributions 


The family of distributions having pmf 


a(x)0° 
f(x |8) = >; ‘%~=0,12,...; 0>0 (2.89) 
c(8) 


where the function a(x) is such that y’ a(x)0° <oo, for some 9 > 0. The Bin(n, p) distribution is its 


x 


1 
member for a(x) = , 0= J c(®) =(1+ 9)". The other well-known members of this class are 
x 1-p 


negative binomial and Poisson distributions. 
The moment generating function is 


Mit) = c(8e ) 
c(8) 
and the mean is 
(x) = Oc (8) . 
c(8) 


Remark 2.66. The family of Power Series Distributions is a subfamily of exponential family of 


distributions since f(x | ®) = a(x)c,(8)exp(xlog®), where c,(8)= (c(®))', and it is a member of 
MPSD class for g(8) = 9. 


Chapter 3 


Bayes Theorem 


3.1 INTRODUCTION 


Uncertainty plays an important role in our lives. A satisfactory description of uncertainty is by means 
of probability. Probability theory provides a powerful tool for understanding, manipulating, and 
controlling this important feature of our appreciation of our environment. Bayesian approach to 
statistical inference exploits the simple idea that the only satisfactory description of uncertainty is by 
means of probability. 

The probabilistic modelling incorporates the available information about the phenomenon and the 
uncertainty pertaining to this information. It allows a quantitative discussion on the problem by 
providing via probability theory a genuine calculus of uncertainty going beyond a mere description 
of deterministic modelling. This is why a probabilistic interpretation is necessary for statistical inference. 

Bayes theorem is an essential element of the Bayesian approach to statistical inference. The 
central feature of Bayesian inference is the direct quantification of uncertainty in terms of probabilistic 
statements. The rules of probability may be invoked to calculate the relevant probabilities of the desired 
statements. The rules by which probabilities cohere are (a) convexity, that is lying in the interval 
[0, 1] with zero for an impossible event, (b) addition, and (c) multiplication laws. The others are derived 
from these basic laws. 

Bayes theorem is also referred to in the literature as the “Principle of inverse probability.” In 
problems of inverse probability we wish to infer what probability model generated the data from the 
information in the data. On the other hand, in the problems of “direct probability” we know the 
probability model including values of its parameters, and our interset is in making probability statements 
about the outcomes or data produced by the known probability model. 

Augustus de Morgan (1838) mentioned that the probability questions are of two different types, 
namely, 

(a) where we know the previous circumstances and require the probability of an event. 

(b) where we know the event which has happened and require the probability which results from 
any particular set of circumstances under which it might have happened. 

Thus, the first is called direct and the second inverse. Development of inverse probability culminated 

with the unconventional use of probability models to validate inference derived from them and then, 

eventually, to extend the domain of application for statistical inference. 


3.2 BAYES THEOREM FOR EVENTS 


The probability of an event A depends upon the available information about the event A. For 
example, if we have a die having two faces with the number 6 and if the event A is that any number 
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other than 6 appears on the die then P(A) = 2/3 and it is not 5/6 (when the die is considered a fair 
one and having distinct numbers on its faces). In order to represent the prior information that the die 
had two faces with the number 6 and denote this event B then we should have used the notation 
P(AIB) instead of P(A). 

Bayes theorem is the basic rule for incorporating the prior information that the event B has 
occurred and influences evaluation of the probability for the event A. The simplest form of Bayes 
theorem 


P(A |B) = —. P(B) >0 (3.1) 


follows easily from the definition of conditional probability 
P(A)P(BIA) = P(AB) = P(B) P(AIB) 

It provides a mechanism of the process of learning by experience. The connection between P(A|B) 
and P(B|A) together with the initial probability P(A) is the basis for the process of acquiring knowledge. 
In general, given two events A and B, the inductive reasoning consists in applying Bayes theorem 
which answers how the information about the occurrence of event B influences P(A). The posterior 
probability P(A|B) is proportional to the initial (prior) probability P(A) and the so called likelihood 
P(B|A). This is the process by which we learn from experience in the sense that experience gives us 
information that can modify our initial belief according to the factor P(BJA)/P(B). 

Remark 3.1. P(A|B) > P(A) if, and only if, P(A|B) > P(A|B) 
(provided P(B')>0, otherwise B is a certain event and, therefore, its probability would not be of 
interest). 
Proof. Using P(B’) = | — P(B) and the law of total probability 

P(A) = P(AJB) P(B) + P(A[B)P(B)) 
we have 

P(AJB) —-P(A) = P(B’) (P(AIB) — P(AIB). 
Remark 3.2. If A and B are events such that P(B) # 0, then P(A|B) and P(BJA) are related by 
P(A|B) = P(BJA) P(A) / [P(BJA')P(A) + P(BJAYP(A)] (3.2) 

It is so because AB and A'B are mutually exclusive events and that AB U A'B = B. Thus 

P(B) = P(AB) + P(AB) 

= P(BJA) P(A) + PBIA)P(A) 

Remark 3.3. For events A, B and C 


P(A|C) P(C]A) 
P(B|C) P(C|B) 


Thus, for two equiprobable events A and B, the ratio of their conditional probabilities, given an event 
C has happened, is the same as the ratio of the conditional probabilities of the event C given the two 
events. 
Example 3.1. An urn contains two coins, one is fair and the other is two-headed. A coin is selected 
at random and tossed. You are allowed to see the up face which is heads. What is the probability that 
the hidden face is also heads? 
Solution. Let A = Two-headed coin is selected. 

B = Head turns up when coin is tossed. 
Note that, if the selected coin is a two-headed, event B is sure to occur, but if the coin is fair, event 
B occurs with probability 1/2. Thus 


when P(A) =P(B). 


Bayes Theorem ae) 


PBA) = 1, PIA) = 1/2 

We have 

P(B) = P(A) P(BJA) + P(A )P(BJA) = 3/4 
Bayes theorem (3.2) gives 

P(A|B) = P(A)P(BJA)/P(B) = 2/3, 

since P(A) = P(A? = 1/2. 
Example 3.2. Anuj tells the truth with probability p, and Brij with probability p,,. 
(i) If they make the same statement, what is the probability that the statement is true? 
(ui) If Anuj makes the statement and Brij says that Anuj is telling a lie, what is the probability that 
Anuj told the truth? 
Solution. Let T = Statement is true 

F = Statement is false 

C = Anuj and Brij make the same statement 

D = Anuj makes a statement and Brij denies it 


Since P(T) = P(F) =1/2, P(C|T) = p,p, , P(C|F) = (-p,)(-p,), P(DIT) = p,-p,), P(D|F) = p,(i-p,) then 
the Bayes rule gives 
(i) P(T|C) = PCT)P(CIT)/[P(T)P(C|T) + PCF)P(C|F)] 
P,P, /2 = P,P, 
pp, tU—pJd—-p,)/2. 1+2pip,—(p,; + B,) 
(ii) ‘If the event A, is that Anuj tells the truth, then 
P(A, |D) = P(A,)P(D| A,)/ P(D) = p,(1—p,) /[p,—p,) + p,—-p,)] 


Example 3.3. The Estrogen-receptor and Progesteron - receptor statuses of tumors from 20 patients with 
locally advanced breast cancer was assessed, with the following proportions: 


a 


Positive 8/20 4/20 
Estrogen 
Negative 1/20 7/20 


Let A be the event that a tumor is Estrogen-receptor positive and B be the event that it is Progesteron- 
receptor positive. Then 


. 8.4 
P(A) = P(AB) + P(AB’) = —+— =0.6 
20 20 


Pa | 
P(B) = P(BA) + P(BA’) = —+ — =0.45 
20 20 


8 
and P(AB) = — = 0.4 
20 
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The conditional probability that an individual is Progesteron receptor, given that tumor is 
Estrogen receptor positive, is 


P(AB) _ 0.4 _ 
P(A) 0.6 


whereas, the conditional probability that an individual is Progesteron receptive, given that tumor is 
Estrogen receptor negative, is 


P(B| A) = 0.67 


P(AB) 1/20 1 
eS =—=0.125 
P(A) 8/20 8 
Thus, the probability that tumor is Progesteron receptor positive is 
P(B) = P(BJA)P(A) + P(BA)P(A) = 9/20 = 0.45. 
Example 3.4. In answering a question on a multiple choice test, an examinee either knows the answer 
with probability p or he guesses with probability q = 1-p. The probability of answering the question 
correctly is | for an examine who knows the answer and 1/m for one who guesses (m being the number 
of multiple choice alternatives). Suppose an examinee answers a question correctly, what is the 
probability that he really knows the answer? 
Solution. Let us denote the events 
K = Examinee knows the answer 
G = Examinee guesses the answer 
C = Examinee answers correctly 
Since  P(C|K) = 1 and P(C|G) = I/m 
Then P(K|C) = P(C|K)P(K) / [P(K) P(C|K) + P(G) P(C|G)] 


P(B| A’) = 


mp 


 qtmp, 


3.3. BAYES FACTOR 


The Bayes theorem version in Remark (3.2) may be rewritten as 


reappy= [42 eels) | 
7 P(A) P(B| A) 33) 
and 

P(a'|B)=1-P(ayp) = PAOLA) /) POO PEIA) 

P(A) P(B] A) P(A) P(B|A) 

P(A|B) P(A) P(B|A) 

ne P(A’ |B) 7 P(A’) P(B| A’) GA) 
P(A |B) P(A) 


The ratio ———— is known as the posterior odds in favour of event A and the factor 
P( 


A’|B) P(A’) 
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is known as prior odds in favour of event A. The ratio of posterior odds and the prior odds in favour 


P(B| A) 


of event A is ————— which is known as Bayes factor in favour of event A. Thus Bayes factor is 


P(B| A’) 
a ratio of conditional probabilities of the available information. If we define likelihood of a model as 
the probability of the observations assuming that the model is true, then Bayes factor may be 
considered as the ratio of likelihoods. 
Example 3.5. A burglar cuts himself on a broken window in the process of committing a crime. His DNA 
analysis reveals that only one in hundred thousand would match the culprit’s blood. A man is charged 
with committing the crime and DNA analysis of his blood finds that it indeed matches that on the 
glass. Find the Bayes factor in favour of guilt. 
Solution. Let B be the event that man’s blood matches that on the glass and G be the event that the 
man is guilty. Since 

P(B|G) = 1 and P(BIG’) = 1/100,000. 

Thus the Bayes factor in favour of guilt is 


P(B|G) 


— = 1,00, 000. 
P(B|G) 


3.4 GENERALIZED BAYES THEOREM FOR EVENTS 


Let A, A, ..., be an infinite sequence of disjoint events with U A, =S and P(A,) > 0 for 


i=l] 


i= 1, 2, .... Suppose B is any other event such that P(B) > 0. Then 


P(A, |B) =P(B| arcay |S P(B|A,)P(A,), i=1,2,.... (3.5) 
j=l 
A similar result holds for a finite sequence of disjoint events A,, A,, ..., A, satisfying the above 
conditions. 


Proof: For any fixed value i (i = 1, 2, ...) 
P(A|B) = P(BA,)/P(B) 
= P(BIA,)P(A,)/P(B). 


However, B = U BA, and the events BA, BA,, .... are disjoint, we have 


jel 


P(B)=) P(BA,)=) P(A,)P(BI A,) 


jel 
Hence the result. 
Remark 3.4. William Burnside (1924), following Poincare (1912), stated the formula 


P(A, |B) = P(B| aecay/ y" P(BIA,)P(A,), i=1,2,....n. (3.6) 


jel 


as Bayes’ formula. Karl Pearson (1924) noted that this is only an element in Bayes’ ‘Essay’ (1764). 
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Remark 3.5. Bayes theorem is only a statement of conditional probability and therefore, there is no 
question of its validity. The real question is its applicability to general problems of scientific inference 
since prior probabilities are hard to assess. 
Remark 3.6. Rev. Thomas Bayes defined probability as an expectation of the future pay -off of a unit 
bet rather than as limiting frequency and proved that if E, and E, are two events, then 

PEJE,) = PEE, V/PE,) 
and PEE.) = PEE, VP) 
in Section 1 of the “Essay”. Bayes introduced the order in which we learn about the happening of the 
event. He solved the following problem: 

Given: the number of times in which an event has happened and failed. Required: the chance 
that the probability of its happening in a single trial lies some where between any two degrees of 
probability that can be named. 

In other words, he considered the problem: 

Event M has happened p times and failed to happen q times. For any a and b, what is 
P(a < P() < blip, q)? 

Bayes used Newton’s geometric approach to show that it is 


b 1 
| xa—'es /f x"(1—x)"dx, 
a 0 

under the assumption that all values of the unknown probability are equally likely before the 
observations are made. 

Example 3.6. Given three urns with white and black balls. The probabilities of drawing a white ball out 
of these urns are p, = 1, p, = 2/3, and p, = 1/3, respectively. The urns are equally probable. We have 
chosen an urn and drawn a ball from it. It is white. The probability that the ball belongs to the first 
urn is 


P(Ball is white | urn 1) P(urn 1) 


P[urn 1| Ball is white] = 


y P[Ball is white | urn i] P(urn i) 


i=l 


Example 3.7. Suppose you meet a stranger in a bar who offers to toss a coin to decide who shall pay 
for the drinks. He will pay if the coin falls tails, you will pay if it falls heads. The thought crosses your 
mind that the coin may be two-headed, but you are too delicate to ask to examine it. What is the 
probability of the coin being two-headed? 

Solution. Suppose a-priori probability of the two-headed coin is p. Thus the probability of “the fair 
coin” is (1—-p). Suppose the stranger tosses the coin and it falls heads. The Bayes theorem allows you 
to modify your probability in the following way: 

P[Coin is two-headed | heads) 


P(Heads | Coin is two-headed)P(Coin is two-headed) 


7 P(Heads|coin is two headed) P(Coin is two headed) 
+ P(Heads|Coins is fair) P(Coin is fair) 


= 2p/(1+p). 
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Thus if you are certain, before the coin is tossed, that the coin is two-headed (p = 1) or that the “coin 
is fair” (p = 0), outcome of the toss will not change your probability. However, if p = 1/2, then the 
posterior probability of two-headed, given heads appeared, increases to 2/3. 

Example 3.8. Ram, Shyam and Mohan are in a jail and are suspected to have killed an old couple. The 
authorities have decided to hang two of them but it is not known which two. Ram decides that he has 
a 2/3 chance of being hanged since there are three possible pairs Ram and Shyam (RS), Shyam and 
Mohan (SM) and Mohan and Ram (MR) of persons to be hanged. If the probabilities of each pair is 
1/3 then the event that Ram gets hanged is the union RSURM. Ram asks the jail warden for more 
information. He tells to Ram that which one of the other two people is to be hanged. Suppose he says 
that a Shyam is going to be hanged with probability 1/2 assuming that Shyam and Mohan were in the 
mind of jail warden. 

Let event D be that jail warden says Shyam. Then 


P(D| RS)P(RS) 


P(RS | D) = 
P(D | RS)P(RS) + P(D | RM)P(RM) + P(D | SM)P(SM) 
_ 1(1/3) 2 
~ 1(1/3)+0(1/3)+(1/2)(1/3) 3 
P(RM |D) = on) =0 
1(1/3)+0(1/3)+(1/2)(1/3) 
P(SM |D) = (1/2)(1/3) 1 


1(1/3)+0(1/3)+(1/2)(1/3) 3 


Therefore, the probability of Ram in the list of persons to be hanged is 2/3. 

Remark 3.7. If the jail warden had said that the probability of Shyam to be hanged was changed to 
one then the probability of Ram and Shyam drops to 1/2 which is the posterior probability of Ram 
being hanged. Since 


10/3) 1 
1(1/3)+00/3)+10/3) 2 


P(RS|D) 


On the other hand, if it is changed to zero which will happen when the jail warden always says Mohan 
when it is going to be Shyam and Mohan then the probability of Ram being hanged increases to 1. 
Since 


10/3) 
10./3)+0(1/3)+0(/3) 


Example 3.9. Suppose that an urn was filled with ten balls in the following manner. A fair coin was 
tossed 10 times and according as it showed heads or tails, one white or one black ball was put into 
the urn. Balls are then drawn with replacement from this urn one at a time, m times in succession and 
every one turns out to be white. What is the probability that the urn contains nothing but white balls? 
Solution. Let us define the event A, as the urn contains i white balls @ = 1, 2, ..., 10) and the event B 
denotes that in m independent trials with a definite but unknown probability 8, only white balls appear. 


P(RS|D) 


60 Bayesian Parametric Inference 


Then, 


and P(B| A,) (i) 
10 


Then the probability 


[ Ne ) 
P(A,, )P(B/ A 2 10 
P(A,, |B) = AIP | Av) = = 


y P(A, )P(B | A, ) y ~~ iG 


i=l 1 
~ . 10 : m 
ial i \10 


1 1u0\y; ry 10 oe 10 \ _™ 
Since 3 ] -y [5] <> e 


-10 


i=l 1 


which tends to 1 as m > ©. 
In particular, if m = 10 
P(A,, |B) > (1 +e)" = 0.0436 
and for m = 100 
P(A, |B) > 1 +e")? = 0.9995. 
However, the a-priori probability of having only white balls in the urn is 1/2'° = 0.0009 which is much 
less than the lower bound of the posterior probability P(A, |B). 


10) 


3.5 BAYES THEOREM FOR FUTURE EVENTS 


Let C be an event which occurs after event B and which itself follows one and only one of the 
events A,, A,, .., A,. We are interested in calculating the probability of C when it is known that B has 
happened. From the Bayes theorem 
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P(BAC) 
P(B) 


P(C|B) = 


y P(BAC|A, )P(A, ) 


_ kel 


Y. PBA, PA,) 


k=l 


y P(C| A, AB)P(BI A, )P(A, ) 


_ kel 


2 (3.7) 
y" PBI A, P(A, ) 


Example 3.10. A bag contains 10 balls, either black or white, but it is not known how many of each? 
A ball is drawn at random and is white. What is the probability that if a second ball is drawn at random 
without replacement it will also be white? 


Solution. Let A, be the event that there are k white balls in a bag and event B is that a white ball is 
drawn in the first draw and event C is that a white ball is drawn in the second draw. 


k k-1 
P(B|A,)=—, k=0,1,2,...,.10 P(C|A,,B) = ——, k=1,2,..., 10. 
10 9 


The range of k does not include zero because one white ball has already been observed. 


y" k(k-1)/90 p, 
P(C|B) == 


ok , where P(A,) = p, 


In particular, if all compositions of the bag are equally likely, i.e., p, = 1/11, k = 0, 1, ..., 10, we have 
2 
P(C |B) =— 
3 
Example 3.11. Suppose an urn contains n coins with probability (k/n) of falling heads for the k™ coin. 


A coin is drawn at random from the urn and tossed t times. What is the probability that the s tosses 
of the coin result in heads when the first (t-s) tosses resulted in heads? 
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Solution. Define the events 
A = first (t-s) tosses result in heads 
B = last s tosses result in heads 
C, = kth coin is drawn 

Since 


P(AMB) 
P(A) 


P(B| A) = 


Y. PAMBIC PC) 


_ kel 


y P(A|C,)P(C,) 


3.6 BAYES THEOREM FOR HYPOTHESES 
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The Bayes theorem is also referred to in the literature as a formula for probabilities of hypotheses. 
In fact, the fundamental purpose of a statistical analysis is inversion purpose because it aims at 


retrieving the causes (parameters) from the effects (observations). 


The totality of evidence is made up of two components, namely “data” and “prior information”. 
The prior information may come from the data obtained from the past experiment or from the theoretical 
background of the decision maker about the data generating mechanism. It is important to consider the 
order in which they are to be incorporated in our calculations to reach some inference or decision. The 
“prior” does not necessarily mean earlier in time. In fact, any information other than the current data 
may be considered as prior information. The posterior information will, therefore, mean information 
which is updated in the light of information obtained by the current data. Thus distinction between 


prior probability and posterior probability is only conventional. 


Let us denote by I, D and H, the prior information, the data, and some hypothesis, respectively, 
then P(A|I), which is conditional on prior information I alone, is the prior probability of the event A. 


Bayes Theorem 63 


The product rule 
POD, HID = PDJH.D P(HID (3.8) 
= P(H)D, D PDD 
gives 


P(H| DP(D| HD 
P(D|D , 


The Bayes theorem provides the conditional probability of the hypothesis H in the light of the current 
data D and the prior information I. This calculation requires the sampling probability of D, given I and 
H, and also the prior probabilities P(D|D and P(H]D. 

We know how to determine numerical values of sampling probabilities P(D|H,D but we do not 
know how to determine prior probabilities. For examples, hypothesis H could be that the mean of the 
normal distribution is zero and variance unity, that is, the data generating process is completely 
specified, provided the prior information I was specified equally well. This is difficult since the problems 
are of different nature. 

The posterior probability P(H|D, I) means logically later in the particular chain of inference being 
made. The term P(DJH,D) is called the sampling distribution when H is fixed. If we consider a fixed data 
set, in the light of different hypotheses H,, H,, ...; in its dependence on H for fixed D, P(DIH, J) is called 
“likelihood”. Note that a likelihood of H is not itself a probability for H; it is dimensionless numerical 
function which when multiplied by a prior probability and a normalizing factor, may become a probability. 

Equation (3.9) is a fundamental principle underlying a wider class of scientific inferences in which 

we try to draw conclusion from the data. 
Remark 3.8. According to E.T. Jaynes (2003, page 112) Bayes theorem was never written by Rev. 
Thomas Bayes. The product rule of probability theory was recognized by James Bernoulli (1713) and 
A. deMoivre (1718) long before the work of Bayes. It was not Bayes but Laplace (1774) who first saw 
the result in generality and showed how to use it in real problems of statistical inference. 


P(H | D,I) = (3.9) 


3.7 BAYES THEOREM FOR RANDOM VARIABLES 


Probabilistic modeling is meaningful if it can provide an adequate representation of the observed 
phenomenon. However, even when modeling is appropriate, it is often difficult to know exactly the 
probability distribution underlying the generation of the observations. 

The parametric approach represents the distribution of the observations through a density (or 
mass) function f(x|®), where only the parameter 0 (of finite dimension) is unknown. This takes into 
account the fact that a finite number of observations can efficiently estimate only a finite number of 
observations. The purpose of statistical analysis is to retrieve the cause(s) (parameter(s) of the model) 
from the effects (observations). It is interesting to note this inverting aspect of statistical inference in 
the notion of the likelihood function. The likelihood function is nothing but the sample density rewritten 


as a function of 8, when 0 is unknown, depending on the observed value x. Thus ¢(8| x) =f(x |). 


In fact, Bayes and Laplace represented uncertainty on the parameter 0 of a model by a probability 
distribution f on the parameter space ©, called prior distribution. It was a daring step to put causes 
(parameters) and effects (observation) on the same conceptual level, since both of them can have 
probability distributions. 

Thus there is little difference between observations and parameters for any statistical 
manipulation. The inference about the unknown parameter 8 can now be drawn from the conditional 
distribution of @ given x. 
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In a general problem in which we have data x and require inference about a parameter 9, the 
product rule gives 


dF(x) dF(6|x) = dF(x, 8) = dF(6)dF(x|6) (3.10) 
Thus 
dF(6|x) = dF(®) dF(x|®)/dF(x). (3.11) 
If every distribution function possesses a corresponding probability density function, then 
f(O|x) = f(8) f(x|®) / f(x), (3.12) 
where 


f(x) =| f(x | @)F(@)de G.13) 
° 
is the marginal density function of x. 

Quite often both the parameter 8 and the data x are continuous and all the frequency functions 
in (3.12) are continuous. There are cases where the data are discrete, so that both f(x|®) and f(x) would 
be probability mass functions. Exceptionally, the parameter 8 can be discrete. Then both the prior and 
posterior are discrete probability mass functions. In order to cover both cases, it is better to rewrite 


the denominator in the more general form, that is, 


f(x)= | f (x | 0)dF(6) (3.14) 
ro) 
Bayes theorem of passing from a conditional and marginal density to the other conditional 
density is fundamental to Bayesian parametric inference. 


Case (i) X and 8 are both discrete random variables 


Suppose the random variable X can take on values x,, x,, ... with probabilities dependent on the 
parameter 0 having parameter space © = {0,, 9,, ...}. The 

PX =x,|6=6)=p(0); k=1,2,..,;i= 1,2... 
From Bayes theorem 


P[O = 6 |PIX =x, |@=6]] 
P[O = |X=x,]= ae che (3.15) 


y P[O = 0, |P[X =x, |O=6] 


i=l 


The conditional distribution of 6 under the condition that X = x,, is the posterior distribution of 8 and 
the marginal distribution of 8 specified by P(© =6,) for i = 1,2,... is called the prior distribution of 0. 
Case (ii) The parameter 0 is discrete and the random variable X is continuous 


As before let 6. O,,.05. be the values of the parameter 9 and if the random variable X is 
continuous and has the conditional density f(x|6,), then 


P[O = 6 |P[Ix <X<x+h|@=6] 


P(O=0.|x<SX<xth]=— 
yy PO =6]P[x <X<x+h|@=6] 


i=l 


sh>O0,i=1,2,... 


Let us divide numerator and denominator on the right hand side by h and take the limit as h > 0. 
We have 
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PIO=8,|P[x|9] 
P(@=8 |X =x)= ,i =1,2.... (3.16) 


y P[O = 6 JP[x | 4] 


i=l 


Case (iii) X is discrete and 8 is continuous 

Suppose the rv X takes on values x,, x,, .... and has a probability mass function dependent on 
the parameter @ which itself is a random Watiable (denoted by ©) of the continuous type having pdf 
g(9). Let us assume that the limit as h 40, P(X =x, |@<@<O+h) exists and is given by 
P(X = x, | © = 6). If g(6|x,) is the conditional density of the rv 8, given X = x,, then 
_ PiO<O<6+h|X=x,] 
g(8| x, ) = lim — 

h>0 h 

_ PO<O<O6+h] 
and g(8) = lim ————_.. 
On using Bayes theorem for the events, we have 
PI09<O<O+h|P[X=x, |O6<O<O+h] 


P[|0< @<O+h|X=x,]= 
P[X =x, ] 


On dividing by h on both sides and taking the limit as h — 0, we have 


g(8)P[X =x, |O= a) 
g(8|x,)= ae (3.17) 


Further, since P[X = x,]= | s(@Prx =x, |O =6]d0 (3.18) 


e 


We may write 


g(8)P[X =x, |@=6] 


g(8|x,) = 
J scPrx = x, | @ = 6]40 G.19) 


° 


Case (iv) X and @ both continuous 


Let f(x, 6), g(8|x), and f(x|6) be the joint density of x and 9, the conditional density of 8, given 
x, and the conditional density of x, given 0, respectively. 
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Under the assumption that the marginal densities m(x) and g(@) of x and 9, respectively, satisfy 
the conditions required for the existence of conditional densities, we have 
f(x, 0) 


m(x) 


g(8|x)= 


and 
7 f (x, 8) 
¢(0) 


f(x | 8) 


Thus, 
2(8)f (x | 8) 


m(x) 


g(8|x)= 


2(0)f (x | 8) 
| 2@F(« | @d0 (3.20) 


ic) 


since 


m(x) = | f(x,6)d0 = | f (x|@)2(0)d8 . 


Example 3.12. Suppose the rv X takes two values 0 and | having the pmf 


1-90 ifx=0 
f(x |6) = 
6 ifx=1 


and the parameter 0 is discrete random variable having pmf 
0.25 if 0=040r0.6 
g(8) = 
0.50 if 6 =0.5 


then the posterior distribution for @ is discrete 


g(0=6)f(X =x, |0=6) 
g0=6 |X=x) =—_ __|_—_——,, 191,2)3;7= 01. 
y" (0 = 8, )f(X =x, |0=6) 


i=l 


Thus 
0.3 if 0=0.4 
g(8| x =0) =40.5 if 9 =0.5 


0.2 if 8 = 0.6 
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and 
0.2 if 0=04 
g(0@|x =) = 40.5 if@ =0.5 
0.3 if 8 =0.6 


However, if we assume our prior to be of continuous type such that g(8) = 1, 0 < @ < 1, then 
the joint density function for x and 0 is 


1-60 if x=0;0<6<1 
f(x, 0) = 
6 ifx=1;0<0<1 


and the marginal probability mass function of X is 


m(x) =| £(x,@)de, x= 0.1. 


0 


f 1 
| (1-0)d0=— ifx=0 
2 


0 


| 1 
J ede=— —ifx=1 
2 
and the posterior pdf for 0 is 
g(9)f (x | 8) 
g(0| x) = SPAT) 
m(x) 


i if x =0;0<0<1 


20 ifx=1;0<0<1— 


In case, we are sure that 6 lies in the interval (0.4, 0.6) but we do not know what is the actual value 
of 8, then g(®) may be taken as a vague prior 


1 
— if 04<0<06 
g(0) = 4 0.2 


0 otherwise 
Then the marginal pmf for x is 


0.6 


1 
| 5(1-@)d0=— if x =0 
2 
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Therefore, the posterior pdf for @ is 
100-8) if x=0;04<0<0.6 
g(8| x)= 
100 ifx=1;04<0<0.6 


Example 3.13. (Uspensky, 1937) Suppose a coin is tossed n times and r heads are observed. The 
probability 8 of the coin falling heads in any of the n trials is unknown but it is known that it takes 
one of the values 0. 6,, siey 0, with probability P(® = 8.) =p, i= 1, 2, ...., k. Find the probability that 
@ lies in the interval [a, B] where 0<a<f<1, given that r heads were obtained in the n tosses of 
the coin. 
Solution. Define the event 

A: r heads in n trials 

B,:0=0;i= 1, 2, ..., k. 

C+; {9:0 ¢ [a, Blj,0, Be (0,0, 6). 
Using the Bayes theorem, 


P(A |B. )P(B. 
P(B, | A) = (A | B, )P(B; ) 
P(A) 
n 
8, (1-9,)""p, 

r 2 6.(1—98.)""p, 
Li n . rT n=t 
y pU=0)p 2 8 8) a: 
isl r is 


Therefore, P(C| A) = Pla <0<f |r heads inn trials] 


y 0, d= 0)" "P, 


an 3.21 
yi #a-6)""p, a 


i=l 


where the summation in the numerator refers to all values of 9; € [o, B]. 


Remark 3.9. This example is a discrete version of Rev. Thomas Bayes’ problem in his famous ‘Essay’ 
published posthumously in 1764 by his friend Rev. Richard Price. 

Remark 3.10. If the parameter space © of probability of success 0 is [0, 1] and the prior distribution 
of @ is g(8) then 


9° (1—@)"" g(6)d0 


Pia < 8 < |r heads in n trials] = : 
9° (1-8) g(@)d0 


Co [2 
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Bayes in his ‘Essay’ considered g(8) =1, Oe [0, 1] and obtained 
8 
| 6 (1—6)""de 
Pia < 8 <B|r heads in n trials] = so (3.22) 


| 9 (1-6)""de 


In particular, if we take p, = I/k and 0.= i/k, i= 1,2,...,.k in (3.21), and let k > © then 


iste Gy iyi .*. = 
ix > [e'(1-0)" ‘a0 
Eel(-z) eelee-9 


and 


therefore, in the limit P[a < 9 < 8 |r heads in n trials] reduces to (3.22). 


Example 3.14. Suppose I have a coin about which I am relatively convinced that it is a fair coin. Let 
us consider the prior distribution of the probability 6 of heads on the coin to be the Beta(m, m) 
distribution. We know that Beta(m, m) is symmetric about @ = 1/2. For large values of m, it will have a 
small variance and, therefore, it will be highly concentrated around @ = 1/2. 
If the coin is tossed 100 times and I get heads 48 times, the posterior pdf of @ will be 

g(Olk = 48) « g(8) F(O|k) 


oe QQ (1 = Oy 


which is a kernel of Beta(m + k, 100 — k + m) density. If we estimate the probability @ of Heads by the 
posterior mean, we have, for m = 10, 
m+k 0.58 
E(6|k = 48) = = = 0.48. 
100+2m 120 
However, if we were not quite sure whether it was a fair coin, we may take a uniform prior for 0, that 
is, m= 1. Then 


49 
E(0|k = 48) = — =0.48. 
102 


Thus it appears that it is insignificant to know whether the coin is fair or not. In fact, here the 
prior information is insignificant in comparison of sample information. The sample information is 
dominating the prior information. 

However, if we think that the coin is a loaded one with a high probability of getting tails. We 
may represent this prior information by taking Beta(2, 99) as the prior distribution. Then 


50 
E(0|k = 48) = — = 0.248. 
201 


In general, if the prior distribution of 0 is Beta(a, B), and the coin is tossed n times such that n is much 
larger than (o + ), then 
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a k 
at+k k a+ 
E(6|k) = -—1_4_-_,- a: aan 
at+Bp+n a+B n n 
eat ors | 
n 


Note that this is what we get from the classical definition of the probability defined as relative 
frequency. The Bayesian answer converges to the frequentist answer when the sample information 
swamps the prior information. 
Remark 3.11. The posterior distribution incorporates the information from the data as well as the prior 
information. We might expect that it will be less variable than the prior distribution. Iterative 
expectations give E(@) = E(E(6|x)). 

Thus prior mean of @ is the average of all possible posterior means, being a function of the data 
x; over the marginal distribution m(x) of possible data. 


Example 3.15. If f(x | 8) is N(®, 6”), 6? known; and the prior pdf of 8 is N(u, a ), then g(6|x) is normal 


=i 1 

n xV1 1 i. a 

with mean | —> +> F bis and variance | —, + — | - The marginal pdf of X is also normal 
o o Oo o Oo 


ae E t.. *) 
EE(@|x)=E| —-¢— | —+— | = caveats el ae ney 
0, oO 0, oO (eo) (e) (om Oo 


Var(8) »{ 1 1 
d =0,| —+— |>l. 
Var(8 | x) 0, oO 


Furthermore, the posterior variance is, on average, smaller than the prior variance by an amount 
that depends on the variation in posterior means over the distribution of data. In mathematical symbols 


E(Var(0 | x)) = Var(8) — Var(E(0| x)). 


The greater the latter variation, the more chances of reducing our uncertainty with regard to 0. 


3.8 SEQUENTIAL USE OF THE BAYES THEOREM 


Bayes theorem may also be used sequentially. Suppose we have two independently collected 
samples X, and X,. Then, 
g(Olx,, x,) x f(x,, x]®) g(8) 

= f(x®) fx,|9) 2(6) 

x f(x,|®) g(6|x,) 
that is, we can obtain the posterior distribution for the combined sample (x,, x,) by first finding g(®|x,) 
and then treating it as the prior for the second sample x,. This simple algorithm for updating the 
posterior is quite natural when the data arrive sequentially over time. 
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Example 3.16. Suppose x = (x,, x,) be a random sample from Bin(n, ®) with n known. Let the prior 
distribution of @ be Beta(a, B). The posterior distribution of 8, given x,, g(8|x,) is Beta(a+x,, B+n—x,). 
After the next observation X, the posterior distribution becomes 


g(O8|x,,x,) « f(x, | ®)g(®| x,) 
= (0° ce ey” io ie eye ) 


= grr = je 
which is the Beta(a + x, + x,, 2n + B — x, —x,). 


However, for the full sample x = (x,, x,), the posterior distribution is obtained by 


g(8|x,,x,) < f(x,,x, | ®)g(8) 
= f(x, | ®)f(x, | )g(8) 
x (6° (1- 6)" )(@" d-)"™ )(e"'d-6)"") 


= aa G _ jo 

which is also Beta(a + X,+X,, 2n+B- X,— X,), as obtained above. 

In general, it does not matter the order in which iid observations are collected and also that 
updating the prior by one observation at a time or all observations together. 
Example 3.17. (O’Hagan and Forster, 2004) A series of n Bernoulli trials is performed with unknown 
probability 8 of success. An extra trial is performed, independent of the earlier n trials but with 
probability 6/2 of success. Suppose that @ has uniform prior density g(®) = 1, 0 € [0,1]. If x, denotes 
the number of successes in the first n trials then 


n 
f(x, |6)= e" 1-8)" ";) x =0,1,....n. 


Xx, 


Suppose the outcome of the extra trial is denoted by x, having value 1 if the extra trial is a success, 
otherwise 0. The posterior distribution of 8, given the data (X,, X,), iS 


f(x, | ®)g(8|x,) 


g(0 | X,,X,) = 1 
| fc, |@g@| x, a0 


0 


(2—06)0" (1-0)" “(n +2) 
(2n—x, +3)B(x, +Ln—-x, +] 


if x, =0 


6" 1-0)" “(n+ 2) 


(x, +DB(x, +1,n-x, +1) 


if x, =1 


2 
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3.9 TRUNCATED PRIOR DISTRIBUTION 


Suppose the data x are drawn from a population with pdf (or pmf) f(x |) where the parameter 


0¢@. Suppose the prior distribution of © over the whole parameter space © is g(@). Then, the 
posterior distribution of 6 is given by 


f(x | 8)g(8) 


g(8| x) = ,0e0 
| fx] ®g(@a0 


(3.23) 


However, if our prior information about 0 is such that 


g(8) 
| sao 


T 


g(8) = , f0eTcO, 


then the posterior distribution over the restricted parameter space T is obtained by 


g(8| x) 
g,(8| x) = if 0 eT. 


J, e@| nae 


It is interesting to note that the posterior distribution remains the same even if the process is 
reversed, i.e., we first truncate the prior distribution over the restricted parameter space T and use this 
truncated prior distribution to obtain the truncated posterior distribution over T. 

Example 3.18. Suppose the random variable X ~ N(@, 1) and the prior distribution for 8 is N(u, 1) except 
that we know that 8 > 0. The truncated prior distribution of 0 is 


1 oo] oe] 1 oo] coe] 
V2n 2 Jon 2 


a ae = ,if@>0. 


j 1 oo] oot] ” B(u) 
» Non 2 


Therefore, the posterior density of @ given x is 


pepe Oe (0, °°) 


[soe cx | 40 


(8-1) 


: : oo : oo] ee 
_ von OW) 2 | von 2 
| : : oo] cour : oo] ue) Je 
| Von Bq) 2 |v2n : 
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1 *\2 
-—.2(0-6 ry 
i an] = | apt) 4205 


1 _— @(0') 
| exp ~5 208-8) de 


‘ +X 
where 9 = 


However, if we start with the prior distribution for 8 as N(u, 1), for all @e (—c», 00), then, g(6|x) is 
N((u+x)/2, 1/2). Now if we truncate g(@|x) over the range (0, °°), the posterior distribution becomes 


6 
etig2 


| g(8| x)d@ 


0 


, 0€ [0,0] 


nm” exp [-« ~ 6)° | 
7 (6') 


Extension Rule. A second useful rule of probability that easily follows from the basic rules of 
probability is the extension rule. It says 


, Be [0.0]. 


P(O|x.D=)) P(x] 6,0, DP(O| 6, D, (3.24) 
o 


where @ is a nuisance parameter. 

Its usefulness lies in the fact that the judgements about the data x often not only involve the quantity 
of interest 8 but also nuisance quantities @. The rule allows these to be eliminated by summation (or 
integration). Bayesian methods include both Bayes theorem and the extension rule. In general, 


g(6|x)=| 2(6,6| xdo 


= I, £(9| >, x)g(o| x)do (3.25) 


Thus g(6|x) is a mixture of the conditional posterior distributions, given the nuisance parameter , where 
g(|x) is a weighting function for the different values of 6. The weights depend on the marginal 
posterior density of o and thus on a combination of data and prior distribution of 0. 


3.10 BAYES THEOREM AS OPTIMAL INFORMATION PROCESSING RULE 


Bayes theorem depends on the product rule 

P(A, BID = P(A|D P(BJA, D (3.26) 
of probability, where I denotes initial information. Jeffreys (1967, pages 24-25) assumed that the 
elements of sets A and B, and ACB are equally probable. However, product rule of probability is 
difficult to prove under general conditions and thus is often introduced as an axiom (See Zellner, 1991, 
pages 25-26). 
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Zellner (1988) formulated an information processing approach which is useful in inference 
situations. An acceptable inference procedure should have the property that it neither ignores any of 
the input information nor inputs any false information. We shall see in Chapter 5 how we may use 
information theory to construct non-informative prior distributions by maximizing entropy. 

Let g(8|I be a prior density for the parameter 0 and f(x|®, I) be the density of data x, based on 
prior information I. The outputs of an information processing rule (IPR) are g(0|x, I), post-data density 
for 8 and the marginal density m(x|I) for x. Note that 


m(x|I)= | f(x | 8, Dg(8| Ide (3.27) 
2) 
is an output obtained by mixing prior and data probabilistic information. 
A good and efficient IPR should satisfy the Information Conservation Principle (ICP) namely: 
Input Information = Output Information. 
We shall suppress I appearing in all the expressions. 
Consider average of the logarithm pdf with post-data density g(6|x) as post-data information 
measures. Thus input information in the data density f(x|®) and the prior density g(®) are 


| g(8|x)log f(x |®)d@ and | g(6| x) log g(8)dO, respectively. Similarly, information in g(O|x) is 


2) 2) 


| 2(@|x)log g(@| xae and information in marginal density of the data x is 


oe 


| g(8| x) log m(x)d® = logm(x) is the output information. 
2) 

Zellner (1988) used ICP to construct the criterion functional to obtain g(6|x) such that output 
information is as close as possible to the input information and, ideally equal to it. 

Mathematically, the problem is to find a proper pdf g(6|x) which minimizes 


Alg(®|x)]= | g(@| xllog g(0| x) +log m(x)}d0— | g(0| x)[log f (x | @) + log g(0)]d0 


ic) e 


g(8| x)m(x) a6 


v) 
f(x | @)g(6) ee 


=| g(@|x)log 


subject to the constraint | g(6|x)d6=1. 


(2) 


Let us use Lagrange’s method of multipliers to solve this minimization problem. The Lagrange 
function 


b 
L=Alg(0| oval | 2(0| <x-1 
where we are assuming 9e[a, b], a and b finite and is Lagrange multiplier. Note that for 8 € (—09, 0) 
case, we may let a tend to —co and b tend to ~. 
The solution can be easily obtained by using Euler-Lagrange equation, discussed in Section 5.7, 
of the calculus of variations. Thus 
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-< 1015 flog LOE a} f | 


Il 
So 


dg(8|x) f(x |®)g(®) boa oe”) 
we get 
‘ f 
rope ey 
m(x 
Since | gz (0 | x)d0 =1, and using (3.27), we have exp(-A-1) = 1 
Hence g (0|x)= ECE 8): Oe (a,b). 
m(x) 
Furthermore, 
a ) x 1 
|x | 0) {og ei8 [a)ms) + sf a > 0. (3.30) 
dg(8 | x) f (x | ®)g(8) bao asta 2& (|=) 


Hence g (0|x) minimizes A[g(6|x)]. The optimum solution g (8|x) is nothing but the posterior 


pdf yielded by Bayes IPR, that is, Bayes theorem for random variables. Furthermore, 
Alg (8 | x)] = 0. 


Remark 3.11. It is interesting to note that A[g(®@|x)] may be rewritten as 


g(8|x) m(x) \ - G31) 


A[g(8|x)]=2] g(8| x) | 
J g(8) f(x|8) 


7 2| g(0| x) log Gd® = 2E(log G), 
g(8| x) m(x) 
and 
g(8) f(x | 6) 


to the posterior pdf g(6|x). Thus minimizing A[g(@| x)] involves choosing g(6|x) such that E(logG) is 


where G is the geometric mean of the ratios and expectation is taken with respect 


as small as possible. 

Equation (3.31) may be interpreted as an information-theory divergence measure relating to the 
pdfs g(6|x), m(x), g(8), f(x|®), and the negative entropy of g(6|x) relative to the measure g(0)f(x|®)/m(x) 
(= g’(0|x)). That is, 


6 
Ate@|x1=[ 26] x)1og £21 
0 g (8|x) 


dé (3.32) 
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is the negative entropy of g(x) relative to the measure g*(6|x). Thus g*(6|x) is also the maxent 
(maximum entropy) solution. 
Remark 3.13. Bernardo (1988) treats inference about @ as a decision problem where the action space 
consists of the possible posteriors g(®|x) and assumes a utility function u[g(6|x), 8] that describes the 
utility of IPR that leads to g(6|x) and then finds the IPR that maximizes the expected utility. Bernardo 
(1979a) and Savage (1971) show that, for the utility function 

u[g(@|x), 6] = A log g@|x) + B®), 
the optimal IPR is the Bayes theorem. 


Chapter 4 


Conjugate Prior Distribution 


4.1 INTRODUCTION 


Prior information is based on investigator’s experience, intuition, and theoretical ideas. It may be 
contained in samples of historical data obtained by a reasonable scientific experiment, from 
introspection, or casual observations. Prior distribution provides specific, formalised statement of 
currently assumed knowledge in probabilistic terms. A distinctive feature of the Bayesian approach is 
the introduction of a prior density to represent prior information about the possible values of the 
parameters of a model. It’s introduction permits use of Bayes theorem to obtain exact finite sample 
posterior densities and draw inferences about the models and making decisions when the loss 
functions are available. 

In the Bayesian approach prior information about the parameter(s) of a model is represented by 
an appropriately chosen probability density (or mass) function. We must be careful in choosing a prior 
pdf to represent prior information. The prior distribution is a way to summarize the available prior 
information. It may also be considered a tool which provides a unified inferential procedure having 
acceptable frequentist properties. It is not necessary that a chosen prior distribution may represent any 
kind of investigator’s belief in this distribution. Furthermore, the terms prior probability distribution and 
posterior distribution suggest probabilistic initial and final state of information. These terms may not 
be necessarily interpreted in a chronological sense. In fact, any additional information other than the 
current data may be defined as prior information. It is important to be careful in choosing a prior pdf 
to represent prior information. For example, the probability of success in Bernoullian trials has a range 
(0, 1) and, therefore, we must choose some pdf defined over the range (0, 1). On the other hand, the 
variance of a normal distribution may have a range (0, ©). 

Ideally prior distribution should provide specific, formalised statement of currently assumed 
knowledge in probabilistic terms. As the available prior information is not precise enough to determine 
an exact prior distribution, we may have many probability distributions which may represent the 
available information. Some of the reasons for not being able to specify exact prior information are time, 
finances, and patience (willingness) to gather and analyse necessary and relevant information 
Obviously, there is no unique way of choosing a prior distribution and that the resulting inference/ 
decision may be influenced by the chosen prior distribution. The effect may be negligible, moderate, 
or enormous and there is always a possibility of obtaining the final answer with the help of distorted 
prior distribution. 

According to Diaconis and Ylvisaker (1985), there are three distinct Bayesian approaches for 
selection of prior distributions. The classical Bayesian approach considers flat priors to represent 
objectivity in the analysis. Such priors are generally known as nil, vague, diffuse, reference, or non- 
informative priors and there is no clear cut public policy or a method to construct or define a unique 
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objective prior. The modern Bayesian approach allows the priors to have characteristics like closure 
under sampling (conjugacy) (suggested G. Barnard (1954) and later developed by Raiffa & Schlaifer 
(1961)) and specification of hyperparameter values according to some specific criteria. The third 
approach is followed by subjective Bayesians, depends on elicitation of prior distributions based on 
pre-existing scientific knowledge in the area of investigation. This information may be available from 
previous investigations or from non-statistician experts. In fact, most Bayesians follows a mixed 
approach that may combine previous knowledge, mathematical convenience and a desire to be as 
objective as possible. 


Conjugate priors 


Quite often a prior distribution is chosen which satisfies specified summaries. It is usually 

advocated that in the absence of the correct prior, we may pick up the most convenient distribution 
to which the summaries may fit. For example, if the prior mean and variance of a scalar parameter 0 
are given then the most convenient choice is the normal distribution or if 8 is positive, we could easily 
fit a gamma distribution having those moments. It will be seen that most of the time such prior 
distributions produce similar posterior distributions and, therefore, the analysis becomes very 
convenient. However, choosing any other proper prior distribution may not lead to analytically tractable 
posterior distribution. In general, if our prior distribution happens to be such that the posterior is easy 
to summarize, irrespective of actual observed data, then it can be considered as a convenient choice. 
Analytical tractability means that (i) posterior distribution is easily determined using product of 
likelihood function and the prior distribution. The normalising constant, which happens to be the 
marginal distribution of the data, is not formally required to be evaluated. (ii) If the choice of prior is 
such that prior and posterior belong to the same family of distributions then posterior summaries, such 
as expectations and probabilities, are easy to obtain. 
Remark 4.1. The conjugate priors are sometimes called objective because the sampling distribution 
f(x | 8) completely determines the class of prior distributions. However, subjective Bayesians suspect 
use of conjugate priors since they are justified on technical grounds and not obtained by fitting the 
available prior distribution. We may, therefore, consider conjugate priors as a first approximation to 
perform default Bayesian analysis. The conjugate priors are often used in limited information scenario 
since they require specification of a few hyperparameters. 


4.2. SUFFICIENCY 


Definition 4.1. Suppose g is any prior pdf (pmf) of the parameter @ and x is any observation in the 
sample space S. Let g(6|x) denote the posterior pdf (pmf) of 8 which exists and is obtained by using 
the Bayes theorem. A statistic T is a sufficient statistic for the family of pdfs (pmfs), such that, 0e 0, 
if g(8|x,) = g(6|x,) for any prior pdf (or pmf) g and any two observations x,, x, € S, such that, 
T(x,) = T(x). 

In particular, for any prior distribution g(0) for 9, the posterior distribution of 8 for a sequence 


X,, X,, .... X_ Of n independent tosses of the same coin is given by 
1 2 n 


a - (6) 
2(8| x) = ee =g|e 


fe “a-0) ~ g(0)d8 


ic) 
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Hence T(x) = y x, is a sufficient statistic for the family of binomial pmfs since g [« 


i=l 


n 
yx, is same 


i=l 


for any sample of size n as far as y'x, remains unchanged. 

i=l 
Example 4.1. Suppose a random experiment consisting of tossing of a coin five times is conducted and 
the number of heads is counted. Let us denote occurrence of a head by | and that of a tail by 0. The 
experiment is repeated two times and we observe the sequence s, = (1 1 1 0 0) in the first experiment 
and s,=(0 101 1) in the second experiment. Since the number of heads is 3 in both the experiments, 
the posterior pdf of @ works out to be the same for the two experiments, given the number of heads 


in five trials. Thus T(x) = x, is the sufficient statistic for 0. 

i=] 
Example 4.2. Suppose X,, X,, ..., X, is a random sample from U(0, 9) distribution. Denote 
= max(x,, X,, ..., X,). Since the likelihood function of 0, given X1, Xqy +) X,, IS 


1 
£8 |X 5X,5«.,.x,)=—I (6), 


n (%(qy 29) 


the posterior distribution of 8 is a function of X,,) itrespective of the choice of the prior distribution. 
Therefore, if two random samples of size n each are observed from a U(0, 8) population such that the 
maximum of observations from the first sample is equal to that of the second sample, then 


g C | x_,, from the first sample ) =g (0 | x,,, from the second sample ) . 


Thus, T(x) = Kas is a sufficient statistic. 
Thus, a working definition of a sufficient statistic T from Bayesian perspective is 

A statistic T is a sufficient statistic, if for any prior distribution of 0, the posterior distribution 
depends on the observed value of x only through T(x). 


Result 4.1. Suppose x =(X,,X,,....X,) are iid observations from a pdf(pmf) f(x |®) and suppose 
T(x) is a sufficient statistic for this family of distributions. Then 

g(8|x)= g(6| T(x), (4.1) 
Proof. We know 


f(x | 8)g(8) 
f(x | 8)g(8)d0 


can 


° 


On using the Neyman factorization criterion, 


f(x | 8) = u(x)v(T(x), 8), (4.2) 
where u(x) > 0 for all xeS and does not depend on @ and v is a non-negative function which depends 


on X only through T(x), we have 
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v(T(x), 8)8(8) 
J vero, 8)g(@)40 


° 


g(6| x)= g(8| T(x). 


Therefore, if a sufficient statistics T(x) exists for the family of distributions then it is not necessary 
to work with the entire data set. We may consider the experiment where only one observation is 


obtained from the sampling distribution of T(X). This reduces the dimensionality of the problem. 


Example 4.3. Suppose X = (X,,X,,...,X,) is a random sample from N(@, 1) distribution and the prior 
for 8 is N(O, 1). The likelihood function for 0 is 


0 |x,.x,.0x,)=]] f(x, | 8) 


J exp] 1s xy +n0-3)}] 


Hence, the posterior distribution based on the random sample X1, Xyp ve X, is 


i es Lhe —x) +n(0 vi} : ex a 
vom} Lk ale Jom | 2 
g(O|x,,X,,..,X,)= er : 
Le | oxp| 1 Dex -0'|f exp|— 1 (n(0—0)' +0 la 
15 i=l =a 


nx 


n+l 1 es F 
= vx L004 00-8 ri} where 9 = 


2n n+l 
However, if we work out the posterior distribution of 8, given one observation from the sampling 
distribution of the sufficient statistic T(X) = X, the calculations are simplified to a great extent. We 


see that 


n n 5 
0(0| x) =f(x|®)= [20-26-07 | 
21 2 


Hence 
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ies exp| -{n(0- x) + 0°}/2| _ fast 


[ exp[-{n@-x +0} /2]a0 * ™ 


ox “cn +90-6'" | 


Remark 4.2. Sufficiency is an important concept in the classical inference. However, it is not strictly 
necessary to look for the situations where a sufficient statistic exists when we are working in Bayesian 
set up. In fact, the Bayes theorem automatically takes into account the sufficiency aspect. 
Result 4.2. A statistic T is sufficient for the family {f (-|8),0¢€ o} if, and only if, f(x | 6) can be 
factored as follows for all values of x in S and @ in O: 

f(x | 6) = u(x)v(T(x), ®) - 
Here the function u is positive and does not depend on 09 and the function v is non-negative and 
depends on x only through T(x). 
Proof. In order to prove the “if” part, let us consider that T(x) is a sufficient statistic for the family 
{f(.|6), 8 ©}. Suppose g(6) is any prior for 8, such that, g(0)>0 for all 8€ ©. (g(8) = 0 for all 0€ © is 
of no interest). Since the posterior distribution 


f(x | 8)g(8) 


g(8| x)= 
Jee | ©)g(@) do 


for all 8@e O,xeES, 


we have 
f(x |@) = BID Fey | 8)g(8) dé. 
g0y- | 


Since T(x) is sufficient for 8, using Result 4.1, we have 


ea ra u(x) = V(T(x), @)u(x), 


g 


9|T 
where u(x) = | f(x | @)g(®)d® and veri), 6) = SOT. 
‘ g 


The “only if” part is proved in Result 4.1. 
Example 4.4. Let X=(X,,X,,....X,) be a random sample from U(@,,0,) then 


i 8 

f(x |6,,8,) (a Tis cont a dace, 1 Os where x,,, and x, are smallest and largest order 
2 1 

statistics, respectively. Since f(x | 6,, 8,) depends on the data through X,, and x,,. alone, it follows that 

a» Xia) is jointly sufficient statistic for (0, 8,). Hence, 


g(6|x,,....X,) = g(8[Xq),Xq)). 
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Example 4.5. Let X = (X,, X,,...,X,) be a random sample from Pareto distribution having a likelihood 


function 


6" 2 
£(O|x)= oo 0) log = ) 
: a k 
Xx, 


6" ‘ 
=— exp(®n log k) exp -e)) log x, 
i=l 


| X; 
i=l 


=u(x)V(T(x),(@,K)), u(x) > 0. 


Hence T(x) = Y log x, is sufficient for the Pareto family. Thus, 


i=l 


UPL Kook) =a( 8 


y log x, } 
i=l 


4.3. CONSTRUCTION OF CONJUGATE PRIOR 
Closure under Multiplication 


Definition 4.2. Let ¢(0|a,) and g(8|a,) are the pdfs belonging to a family of pdfs G with parameters 
a, and a, (could be vectors). G is said to be closed under multiplication if there is another pdf g(0|0.,) 
in G such that for 8€ 0, 


g(8| a,) « g(B| a, )g(8| a,). (4.3) 
Example 4.6. Let G be the family of beta distributions. For any two members of G, 


g(6| a.) and g(8 | a) as Beta(a,, b,) and Beta(a,, b,) distributions, respectively, then for ® € (0, 1), 
g(0|a,,b,)g(@|a,,b,) « 8" "(ey 
x g(0|a, +a, —1,b, +b, -1). 
Thus, g(0| a) xg(8| a )g(0| a), 


where g(8| a ) is also a Beta distribution but with parameter a = (a, +a, —1,b, +b, —1). Thus, the 


family of beta distributions is closed under multiplication. 
Example 4.7. Suppose G is the family of univariate normal distributions. Taking 


g(8|u,,0,) =N(u,,0,) and g(@|M,,65) =N(u,,6,), then 
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-l ‘ 1 5 
26; “ee : 
1/ 1 1 5 7 
ox exp sta (8-H) |x sCOlH,,95) | 
2\ 0, oO, 
141) 141) 
where LL, = ge ease: G.= araaeres 
9, 9, AO, 9, 9, «O, 


Thus, g(0|[1,,0,) is N(1,,0,) distribution. 


2(0|U,,0, )g(0|U,,0,) « os 


Hence, the family of univariate normal distributions is closed under multiplication. 

Example 4.8. Suppose the sample is drawn from N(9, r) distribution where both 9 and precision r are 
unknown. Consider the joint pdf of @ and r belonging to the Normal-Gamma(H, 7, a, b) family of 
distributions. Since 


g(OJr, HT, a, b) = gO |r, Mt) g(r] a, b) 
where g, (Or, LL, T) is N(u, tr) and g,(r | a, b) is Gamma(a, b), we have 


2(8, tM, t, a, b,) 2, rt, t,, a, b,) ~eo] Ee (O-u,)’+7,(0-p,)’ ‘ox —r(b, +b,)]r?" 


r TU +T i 1 TT : ee 
rexp|——(t,+1,)| 0-2 | exp} 1 2 lay —p,)? +(b, +b,) p pe? 
2 C.%, 2\ t+, 


TU + TW 1 lf tt 
=g,| 8jr, ~~~, 1, + T, |g, | rla, +a, : —— lu, l,) +(b, +b,) 
T, +7, 2 2\ 7, +7, 


TH, +T 1 1f tt 
wa{ or[SB BH create : [-* Jo 14.40) } 


T, +T, : 2.2 
Thus, the normal-gamma family of distributions is closed under multiplication. 


Closure Under Sampling 


Definition 4.3. A family G of prior distributions for the parameter 0 is said to be closed under sampling 
from a population having distribution f(x | 9), if for every prior distribution € G, the posterior distribution 
2(8 | x) < g(8) f(x | 9) is also in G. 

Remark 4.3. Distributions closed under sampling were first introduced by G. Barnard (1954) and later 
Raiffa and Schlaifer (1961) called them conjugate prior distributions. 

Remark 4.4, Let X,, X,, .., X, be a sequence of independently distributed rvs having a common 
distribution f(x | 6) and he) prior ge G then the posterior distribution g(6Jx,, ..., x,) will still belong to 


re. 
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Remark 4.5. The family G having closure property under sampling can be enlarged as follows: Define 


h(8) = u(@)g(@)/ | u@)g(®)d0,O< © (44) 

co) 
for a given g € G, where u(8) is a given positive and bounded function of 8. The family # of h(@) 
generated by g € G is also closed under sampling and so is G U #H. However, it may not be possible 


to obtain analytically the normalising constant | u(8)g(6)d@ for some chosen weight function u(8). 
2) 

Remark 4.6. The prior distribution g(®) may involve some unknown parameter(s) and in order to 
distinguish them with parameters of the sampling distribution f(x | 8), we call the parameters of the prior 
as hyperparameters. These hyperparameters index the family of the prior distribution. 
Remark 4.7. The notion of family that is closed under sampling is useful in the case of sampling from 
distributions belonging to exponential families. 
Definition 4.4. A distribution is said to be natural conjugate to a given sampling process if its pdf 
(or pmf) is proportional to the likelihood function corresponding to some observable sample from the 
process. 

In the light of Remark (4.3), the weighted natural conjugate distribution is sometimes called simply 
conjugate. 
Remark 4.8. A pdf or a pmf g(9) is also called natural conjugate to the likelihood function ¢(6|n, x), 


if the g(@) and likelihood function are proportional as functions of 9. 


Let X=(X,,X,,...,.X,) be a random sample from a population having density 


f(x | 8) with unknown parameter 6. Since the likelihood function of @ is ¢(0|n, x)= I] f(x, | 6) and 


i=l 


since g(8) must integrate to unity, we have 


€(O|n, x) 


g(8) = ————— = V(@n, x), (say) (4.5) 
| (Qn, x) 


ic) 


provided the denominator | (On, x)d0 < co. 


ic) 


Remark 4.9. The parameterized family of natural conjugate distributions is obtained for a given 
sampling process by considering all probability densities, given by (4.5), for all sample sizes n and 


sample values x. 


Remark 4.10. Natural conjugate priors provide priors having mathematically tractable closed form 
expressions to perform Bayesian analysis. 


Remark 4.11. The parameters (n’, x’) of the family are hyperparameters for the family. 
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Property 1. A conjugate family is closed under sampling since for the sampled data (n, x), the 
posterior distribution is 

2(6)£(8|n, x) 
J 248 |n,x a8 


(o} 


g(8[n, x) = 


Let us write 2(8)=V(Oln’, x) with hyperparameters n’ and x” . Then 


V(8|n’,x’)£(8|n, x) 


g(8|n, x)= , v 
| v@|n’.x9¢@|n, x0 


° 


£(0 | n, x’) 
a ree eae red Ge 
| ¢@|n’,x”)ae ; 


iS) 


Remark 4.12. The values of the hyperparameters (n’,x’) of the prior are replaced by the revised 
posterior parameters (n’, x’) . In other words, (n’, x’) is revised in the light of observed sample (n, xX) 


such that n" = n + n' and =(x, x); 
Remark 4.13. Prior uncertainty is based on a hypothetical prior sample or a virtual sample of size n' 


and values x’ = Cae ieee ee Hence, one could interpret a conjugate prior family as consisting of 
posterior distributions, each coming from a suitable imaginary prior sample and the constant initial prior 
density. Some authors call these constant pseudo density models as models representing complete 
ignorance. 

Remark 4.14. The natural conjugate prior requires an additional parameter compared with the sampling 
model f(x | 8). In the multivariate case, the number of hyperparameters is very large and, therefore, 
Bayesian inference concerning the parameters become quite difficult. 

Remark 4.15. Any Bayesian analysis based on natural conjugate priors requires the knowledge of its 
hyperparameters. 

Remark 4.16. For the beta distribution with parameters & and 0 (which belongs to exponential family), 
the natural conjugate prior, with hyperparameters x, and A, takes the form 


T(a+ 60) 


g(8|x,,A) -( 
10) 


Xr 
(- x,)" , with & known, 


which is analytically not convenient to use. 
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Example 4.9. Let X =(X,,X,,...,X,) be a random sample from N(®, 6°), 6? known. The likelihood 


function of 0 is 
n —\2 
£(8|n, x) « eo 2-0-0) . 
IS 


2: 


oO 
Thus the likelihood function is proportional to Ee | density. The conjugate family of prior 


n 


densities is 


n 


ik Ae 
V(8|n,x ) = g(8) is nx ; 


2 
: ee : : oO 
and the posterior distribution is, therefore, N [x = } where 


n 


, , , i ‘one 
_y X, +X, +..+XK +X, +x, +..+x, mn XK +nXx 
x 


, 


ntn n+n 
and 


n’=n+n. 
Property 2. The posterior predictive density for a future independent sample (n’,x’) following a 


sample (n, x) can be obtained by replacing the hyperparameters (n’,x’) in the prior predictive density 


by the posterior parameters (n”, x”). It is easy to see that prior predictive density of X is 
g(x) =| (| n,x)g(6)d0 


7 I. e(8|n, x)V(0 |n’, x’)d0 


,» ofl 1 
Example 4.10. (Example 4.9 continued) The prior predictive density g(x) is N(3 om ( +— } 


n n 
since E(X) = EXE” (X) = B® (0) =x 
and Var(X) = B® (Var'" (X)) + Var" (E' (X)) 


2 2 2 
=e [< } Var" (0) = 2+ = ; 


n 
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The posterior predictive density of future sample x , given sample X, is 


a(x’ [x)= [| n',x )V@|n’,x")d0-. 


This property will be useful in Chapter 8 when predictive densities are used to draw inferences about 
future observations. 


Example 4.11. (Example 4.10 continued) Let y = (y,,y,,-.-.. Y,,) be a future sample from N(@, 07), then 


the posterior predictive density, using Property (2), has the form N(x’,o°(m=+n’")). 


The posterior predictive mean and variance may be obtained as follows 


E(Y | x) = ere cy) = EX“ (@) = xX, 


sed Var(¥ bee BE [ var" (Y) | 4+ Vart™ Eee! 
2 2 2 
= Rom] So |, Var") (8) = _ = 
m m n 


=o (m'+n’’). 
Existence of Conjugate Priors 


Results 4.3. If the family #7 = {f,(.8), 8 € ©} of pdfs admits a sufficient statistic 

T (X,, X,, .... X,) of fixed dimension k (k 2 1) for every sample size n, then there exists a simple 

conjugate family of prior distributions of 0. 

Proof. In order to prove the existence of such a family of prior distributions of 8, we must show that 

(i) for any sample size n and any observed values x,, X,, ..., x,, the conditional joint pdf 
i, wey X, | 8), regarded as a function of 8, is proportional to one of the pdfs of the family, that 
is, it is closed under sampling and 

(i) — the family is closed under multiplication. 

From Result 4.2, there is a function Vv, such that 


Te, Ky one R, 8) VAT Cy 2X), SB). (4.6) 
Writing T (x,, X,, .... X,) = t, there exists a pdf g(@Jt, n) on the parameter space © such that 
g(O|t, n) « v(t, @) x f(x, ..., x, | 8), (4.7) 


provided | V(t, 0)d0< oo. 
oe 

Thus, if f(x, .... X, | 8), as a function of 0, belongs to the family “of pdfs, then g(6 | t, n) also 
belongs to c% Hence is closed under sampling. 

In order to show that Vis closed under multiplication, suppose g(6|s, m) and g(6Jt, n) are 
members of the family ¢% There are two observed independent samples x,, x,, .... X,, and y,, y,, -. Y, 
of size m and n, such that, T@&, ar x.) =s and TY, Mts y,) = t. If we denote the sufficient statistic 
based on the combined sample of size m + n by 
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UAT (Ky wees Xips Vyo vee Vy ds 
we have 
g(O|s, m) g(OJt, n) « f(x,, ..., x,, | ®) fly,, -.. y, | 8) 
= f(K,, «5. X 9 Vp os Y, | 9) 
x g(Blu,mt+n)e KH 
Remark 4.17. The Result 4.3 holds for the general family of pdfs. 


Thumb Rule for Constructing a Conjugate Prior 


Definition 4.5. Raiffa and Schlaifer (1961, page 31) define kernel as 
Suppose the likelihood of 0, given x, is ¢(@|x) and u and v are functions of x such that for all 
x and 0, ¢(6|x) = u(6|x)v(x), that is, the ratio u(@|x) / ¢(6| x) is a constant as regards 0. The function 


u(6| x) is called a kernel (not “the”) of the likelihood of 0, given x, and v(x) is the residue of this 
likelihood. 


For example, if ¢(0| x)= e°0*/ x!, then e°0* is a kernel of the likelihood of given x, and (x!) is the 
residue of this likelihood. 

Definition 4.6. If the pdf of 6 is g, where g denotes either prior or posterior density, and if k is another 
function on 8 such that g(0) = k(®) / | k(®) d®, that is, the ratio k(8)/g(8) is a constant as regards 0, 
we shall write g(0) « k(@) and say k is a (not “the”) kernel of the density of 0 (provided | k(8) dO is 
finite). 


Suppose t(x) is a sufficient statistic for the parameter 0 so that the likelihood function ¢(6 | xX) 


(8 | x) =K(t() | OO) X= Xp X,), 
where k(t(x)| 6) is a kernel of the likelihood function. Replace all the terms in the kernel of the 


likelihood function that are functions of the sample by prior hyperparameters, say a = (a,,a,,...,a,,). 


Then, the conjugate prior is 


(8) « k(t(x) | 9) (4.8) 


t(x)=a 


Example 4.12. Suppose X,, X 
The likelihood function is 


1 n n 
¢(8 | x) = ————_ e —-) x, /O}. 
worst aE 96 


i=l 


yyyX, is a random sample from Gamma(m, 8) distribution with m>0 known. 


Since, t = y x, is a sufficient statistic for 8, a kernel of the likelihood function k(t | ) is 


i=1 


1 t 
k(t | 8) = exp| —— |. 
e™ ) 
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Therefore, the conjugate prior is 


—a,/0 


g 2 
6” 


which is a Inverted-Gamma(a, — 1, a,) with hyperparameters a, and a.,. 


() 


0k 
Example 4.13. If f(x | ®) = —, x>k, k known 
x 


Ee) 1 x \1 
= 0} — | —=6exp| —Olog— |-. 
XJ x k })x 


The likelihood function is 


(| x) = 6" exp (-0 (Zlog x, -ves DET =| 
Xx 


i=l i 


Since, t = y log x, is the sufficient statistic for 8, a kernel k(t|@) of the likelihood function is 
i=l 
k(t | 0) = aie : 
therefore, the conjugate prior is 


g(0) « Be 
which is a Gamma(a, + 1, a,) with hyperparameters a, and a,. 
Example 4.14. Suppose X,, X,, ..., X, is a random sample from the Pareto density 


) 
f(x|@)=—1,_(x), 0>0. 
x 


The likelihood function of 0 is 


-1 
C0 | x) = exp [-£ ee . (ql 7 Tio,4,,)®)- 


i=l i=l 


Since t = Xa = min(x,, ..., X,) is the sufficient statistics for 8, a kernel of the likelihood function is 


k(t}6)=0'1,,. (8), 
eC) 
therefore, the conjugate prior is 
g(8) < OT, (8). 
Example 4.15. Suppose X has one-parameter exponential family of pdfs belonging to the form 


f(x|6) = v(®)h(x)exp(x8). 


Then a member of the conjugate family will have the structure 


g(8| a,b) = ka, b)(v(8))"exp(ad), 


89 
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for some values a and b such that it is a proper density. 
The posterior density will be 


b+ 


2(8 | x) e (v(8))”"exp((a+x)®). 


4.4 CONJUGATE FAMILIES FOR SAMPLES FROM VARIOUS STANDARD DISTRIBUTIONS 


In this section we shall derive conjugate families of distributions for samples from some standard 
discrete and continuous distributions. 


Conjugate Families for Samples from a Normal Distribution 


Result 4.4. Suppose X=(X,,X,,...,X,) is a random sample from N(®, 6”) with o? known. The natural 


conjugate prior pdf is N(u,6,) and the posterior density of 0, given x, is 


— 2 2 2. 2. 
n{ BXOvtHO __9°9, 
no,+o no.+o 


0 


Proof. The likelihood function of 8, given x, is 


£(8| x) « oof sek (x, - 6)’ ) 
7 20° ia 


=exn|- L{ne-9'+) «=f 
20° i=l 


n 2 
o v0 -2-0-%) | 
20 


The posterior density function of 0, given X, is 


g(0| x) « 4(8| x)g(0|,6,) 


lion coe 5 
=e 7 (@-x) +— (@-w) i 
20 0, 


On using the identity 


A(z—a) +B(z—b) =(A+B)(z—c) + ae (a—b)’, (4.9) 
A+B 


Aa+ Bb 
where c = ————., we have 


A+B 
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1 n 1 5 n _ ; 
g(8| x) « exp er (Fae + —— (a) 
2\(o GO, no, +o 
1 no. +o 5 
ox exp a (8-e) 
OO, 
where 
nx WU 
2 2 —_2 2 
oF _ So _ XG, FHS 
n 1 no, +o 
ae 
o Oo, 
2 _ 2 
. . . . . 6,0 
which is a normal density with mean c and variance —; 5° 
no, +o 


Remark 4.18. Some authors prefer to use precision in place of variance. Let us assume that the sample 
is drawn from N(Q, r), with precision r(=1/o7) known. If the conjugate prior of 0 is taken as N(u, 7), 
then the posterior distribution of 8 is normal with mean 
TU + nx a 
u(x) = ———— and precision (T + nr). 
T+ nr 


Remark 4.19. We observe the following features of the normal posterior distribution. 
(i) Posterior precision = prior precision + data precision. 


T 


X+ 
tT+or T+nr 


Gi) = W(X) = Ht. Thus posterior mean is the weighted average of the prior mean 
and the sample mean with weights proportional to the respective precisions. One may say that 
posterior mean is a compromise between the prior mean and the sample mean. 

(ii) |The posterior mean can be expressed as the prior mean adjusted towards the observed sample 


mean, that is, 


7 Oo, 
W(x) = (x-p)—"*— +4, 
Go 2 
— +6, 
n 
whereas, the sample mean is shrunk towards the prior mean, that is, 


as o /n 
W(x) = x —(x-p)— 

fo} 2 

— +60, 
n 


iv) The posterior mean tends to prior mean when the prior variance tends to zero. On the other hand, 
p Pp Pp 
posterior mean tends to sample mean, if the variance of the sample mean tends to zero, that is, 


2 
fora fixed 0, noo, 
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It is interesting to note that making prior variance equal to zero amounts to a degenerate prior 
distribution concentrated at pt and, therefore, prior and posterior distributions become identical. On the 
other hand, as the sample size becomes large, the likelihood function dominates the prior distribution 


and the posterior distribution becomes concentrated at the sample mean x. 
(v) The posterior variance of 0, given x, is 

- Aine « aia = 
Var(6| x) = 7 < Var(X). 


a 2 


2 2 2 2 
o,+o /n n 06,+0/n 


Furthermore, Var(@ | x) < Var(8) , since 


o 
Var(8| x) =o, -——*—. 
0,+0 /n 
Remark 4.20. If we change the prior mean wt by an amount 6 to make it u + 6, then the posterior mean 
50° o 


changes by an amount | ——_ — +0, |. If o, is sufficiently large relative to the sample variance 
n n 


o’/n. then posterior mean becomes insensitive to the change in the prior mean. 


50° o o/n 
Since | —— —+ Oo, — 0, as > 0 
n n .o) 


Similarly, if we let Var(X) =o /n change by an amount 6, then the posterior mean changes by 


5(u- x)o, ; a ; 
an amount — >, ; F - The posterior mean will now be insensitive to the change if 
(6, +0 /n\(o,+5+o /n) 


| u—x | is sufficiently small or 67/n is sufficiently large. 


In other words, if prior information is weak, that is, GC, is large, posterior inference will be 


insensitive to such a misspecification of prior distribution. On the other hand, if the data information 


. . 2 . . . . . eae 
is weak, that is, o /n is large, posterior inferences will be insensitive to a moderate amount of 


misspecfiication of the sampling distribution. 


Remark 4.21. If o° / no, = 0, (that is, prior variance is much larger than the sample variance) then, 


as L(x) =x and posterior variance = o /n, the posterior distribution of @ given X tends to 
N(x,o°/n). This may happen when the prior variance tends to © for a fixed value of the sample 
variance o /n, that is, when data information dominates the prior information. It is interesting to note 
that the limiting posterior distribution N(X,o0°/n) cannot be obtained for any proper prior 


distribution. In fact, the prior distribution N(w, o,) tends to N(UL, c¢) as o, — oo, Since 
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g(8|M, 0.) = 


1 1 2 
—= exp| —-—— (8-H) 
9, 2m 26, 


tends to zero for all 8, and the fact that limiting prior distribution g(8) = 0 does not integrate to unity 
suggests that the limiting posterior distribution N(x,o° /n) cannot be obtained as the posterior 
distribution for any proper prior distribution. In fact, one can get arbitrarily close to it with a normal 
prior having large variance. Since g(®) = 0 for all 6 as om — oo, the Bayes theorem gives g(6 | x) = 0 


for all 8. This is meaningless. 


1 n 
If we consider g(0| X) on ar 0-1 fos] 2-0 by ignoring the constants 
oO 


0 


of proportionality then the first term on the right hand side tends to unity for all 6 as o, — co, Hence, 
n 2 

in the limit g(0| x) « exp [-So- | which is a kernel of the N(x,o°/n) distribution. The 
oO 


direct application of Bayes theorem gives N(X,o°/n) as a limiting posterior distribution as o, >, 


Here we have considered g(8) « g (8) =1. Once again we observe that g‘(@) = 1 as the prior, which 


has given limiting posterior distribution N(X,o° /n), is also improper distribution. 


Result 4.5. Suppose that X,, X,,..., X, is a random sample from a normal distribution with known mean 
8 and unknown precision r. If the prior for r is Gamma(q, f), then the posterior distribution of r is 


Carma a+ 583 (x, 6)’ } 
2, 2 


i=l 


Result 4.6. Suppose X =(X,,X,,...,X,) 1s a random sample from a normal distribution with known 


mean 9 and unknown variance o°. The likelihood function of 67, given x, is 


lo |x) [=] o[- z. y (x, 6)’ } 
~ o 20° ia 


The conjugate prior density for 6? is the Inverted-Gamma(q, B) having the pdf 


| [ ; [ | 
BIG je} — exp| -—~ |, o >0 
oO oO 


n le 
and the resulting posterior density is Ive Gann +—,B+ -) (x, - @) } 
2 2 


i=l 
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Result 4.7. Suppose that X =(X,,X,,...,X,) is a random sample from N(0, r), both 6 and precision 


r unknown. Let the joint prior distribution of (9, r) be such that 

2(8, 1) = (|r) g@), (4.10) 
where the conditional prior g(6|r) for 8, given r, is N(u, tr), tT >O, a known constant, and g(r) is the 
marginal prior density of r having a Gamma(q, B). The joint posterior distribution of 6 and r is such 
that the conditional posterior distribution of 0, given r, is N(u*,(t + n)r) and the marginal posterior 
distribution of r is Gamma(a + n/2, B*) where 


t™ 5 


x)’. 


» Wtnx ; ly p 
meee =B+-) (x, -x) 
2 


+ (u 
ttn = 2(T+n) 


Further, the marginal posterior density of 8 is a 3-parameter t-density with (20 + n) df, location 
parameter |", and scale parameter (t + n) (2a + n) / 2B". 


Proof. The likelihood function of @ and r, given X, is 


6(O,r]x)or” vo : VG, x) = (@—x) } 
~ wh i=] 2 


and g(8,r) = g(8| r)g(r) 


went 0 5 
xr ? exp|-rfp+Z0-w'} 


On using the Bayes theorem, the joint posterior distribution of @ and r, given x, is 


g(0,r| x) « g(8,r)/(8,r| x). 
. 2 =—\2 *.2 ™m =—\2 
Since 7(10—W) +n(@—x) =(t+n)(0—b ) +——(u-xX), we have 
ttn 
T Pa a4—-l 3 
g(@,r | x) « Vie —L¢c+nyo-n |: > exp(-1B), 
‘ Nn oa: 

which is the product of a kernel of Niu, (t+n)r) and that of canna a oo } 


In order to obtain the marginal posterior density of 8, we observe that the marginal prior density 
of 0 is 


: ae - (=) 
(0) = | snare (pre | 
7 2 
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which is a kernel of a 3-parameter t-density with 20 df, location parameter [, and scale parameter Ot/ 
B. Since for a conjugate family of distributions, the form of the posterior distribution remains the same 
but the hyperparameters are revised in the light of data, the marginal posterior distribution of 8 becomes 
a 3-parameter t-density with (20+n) df, location parameter wu", and scale parameter (20 + n)(t + n)/2B". 
Remark 4.22. Suppose we assume that 0 and r are both, a-priori, independent having a non-informative 
prior density g(0, r) « 1/r. Then the posterior density of (0, r) is 


2(0,r| x) « ous [-r{2cx, ~x) +n(0- x)" }/2 | 
Thus, g(r |x) « ea exp[ -1Z(x, -x)*/2 | 


n-1l Y(x,-x)’ 
7 9 


which is can } and the conditional posterior density of 8, given r, is 


g(8|r, x) « exp [ -nr(@ = zy 12] 


which is N(x, nr). 

Note that the conditional posterior density of 8, given r, is same as unconditional posterior 
density of 8 when r is known and g(@) « 1. We also note that even though 0 and r are a-priori 
independent, the posteriors are not so. This suggests that the choice of the joint prior of 8 and r, as 
taken in the Result (4.7), becomes meaningful when our prior is based on past observed data. 


-X 
Remark 4.23. Under the non-informative prior g(0, r) « 1/1, the posterior distribution of Aa has 
s/Vn 


X-6 
a t-distribution with (n—1)df. On the other hand, the sampling distribution of ns for given ® and 
s/Vn 


o°, is also a t-distribution with (n—1)df, where s* = X(x, —x)° /(n 1). Thus, the inferences based on 


Classical and Bayesian approaches may lead to similar conclusions. For example, the classical 
confidence intervals and HPD credible intervals for the unknown parameter @ will come out to be 
numerically same (once the sample is obtained). 

Remark 4.24. If we decide to choose prior distributions for 8 and r to be independent normal and 
gamma distributions, respectively, then the joint posterior distribution of @ and r are independent and 
it does not follow any standard parametric form. 

Remark 4.25. The joint prior distribution of 8 and r considered in the Result (4.7) is known as Normal- 
gamma density. 

Remark 4.26. We may consider the joint prior distribution of 8 and 0’, g(0, 6”) = g(8|o”)g(o”), where 


the conditional prior of 8, given 0°, is N (u, o : n,) and the marginal prior distribution of 6? as a 


1 1, 
Isrs-Gana 5 V, a } then 
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9; —n-v-3 1 2 2 
(6,0 |x)«o ex] fs +n,(8-8,) i}, 
~ 20 


where 


ny 


= 
1 = 2 2 2 1 1 —\2 
n,=n+n,, 09, =—(n,0,+nx), s, =s +s,+] —+— |] (0,-x). 
n n 
It is easy to see that the conditional posterior distribution of 0, given o7, is 
=n, (0- 


0.) 
— and the marginal posterior distribution of 0°, is 
20° 


é 1 
g(8|o ,x) « —exp 
. oO 


-y— 


g(o | x,s')«o"* exp(-s’ / 20°). 


The marginal posterior distribution of 8, as before, is Student’s t-distribution. We should note that, 


except when the prior is built from previous (or virtual) observations, n, is not a sample size. However, 
n/n characterizes the relative precision of the determination of the prior distribution as compared with 


the precision of the observations. If n,/n — 0, we get the limiting conditional posterior distribution 


of 6, given 6?, as N(x,o°/n) which is the posterior distribution associated with the Jeffreys’ non- 


informative prior g(@) = 1. 


1 
Remark 4.27. In particular, if we let @ - -—, B > 0, t- 0, then the marginal posterior density of 
2 


X(x, — x)’ 
2 


-1 
n-1 
6’ reduces to Inverted-Gamma > { » the conditional posterior density of 8, given 


6’, becomes N (x, o / n), and the marginal posterior density of 8 becomes t-density with parameters 


(n-1), x, by (x, -x) /n(n-1). 


i=l 
Conjugate Families for Other Standard Distributions 


Result 4.8. Suppose X,, X,, ..., X, is a random sample from Bin(k, 8), with k known. The conjugate prior 


density for ® is Beta(a, B) and the posterior density of 90, given xX, is 


Betafar x, B+nk-)°x, | 
i=l i=l 
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Remark 4.28. In particular, for sample of size one, the posterior distribution of 0, given x, is 


qt (1 = get 
g(6| x)= , 98e [0,1]. 
B(a+x,B+k—x) 
Since the maximum likelihood estimate of @ is x/k and the mode of the prior distribution is 
(a—1)/(0+B-2) for o, B > 1, and the mode of the posterior distribution is at (x + & —l)/(k + a+ B — 2) 
which is equal to 


x a-l k 
a +(1-a) , for a= —————.,0<a<l. 
k a+B-2 k+a+B-2 


The posterior distribution of @ synthesizes and compromises by favouring values between the maxima 
of prior density and likelihood function. 
The posterior mean may be expressed as 


x a 
i@kane 
E(6| x) = n a+B 


a+B+n 


n x \,_o+B or 
~a+Btn n}) o+B+n a+p | 


Since, x/n is the sample mean and o/(a+) is the prior mean, the posterior mean is a linear combination 
of prior and sample mean. 


Result 4.9. Suppose X = (x, joes peges. ) is arandom sample from Pois(@). Then the natural conjugate 


prior for 6 is Gamma(«, B) and the posterior density of 8, given x, is Gamma(a+ ©x,,B +n). 


Remark 4.29. It is interesting to see that the sample information appearing in the posterior distribution 
is in terms of sufficient statistic for the parameter under question. We further note that all of the above 
sampling distributions belong to the univariate one-parameter exponential family of distributions. 


Result 4.10. Suppose X,,X,,...,X, is a random sample from U(0, 8). The likelihood function of 6, 
given X, is 

9|x)=0", forg>x,; 1=1,2,..40, 
If we denote M = max(X,, X,, 5 Xx), then 


(0 |x)=0"l,,.,), 


where I,(8) is an indicator function defined as 


1 if 0EA 
I, (8) = 


OQ otherwise. 
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The natural conjugate prior for 0 is 


g(8) <8 *"1, (8) 


which is a Pareto(a, §) distribution. The posterior density of 0, given X, is 


g(8| x) « adie Coe (8)I,,,... (8) = ae Fae (6), 


where &, =Q+n and B, =max(B,M), which is a Pareto(a,, B,) distribution. 


Remark 4.30. The sampling distribution of x, which is U(0,8), is not a member of the regular exponential 
family. However, the parameter 0 admits sufficient statistic which is the largest order statistic. Thus, 
we may say that the natural conjugate prior distribution for the parameter 6 may exist even if the 
sampling distribution f(x | 8) does not belong to the regular exponential family of distributions. Result 
4.3 suggests that existence of sufficient statistic for the parameter may help in constructing the natural 
conjugate prior for the parameter 9. 


C) 


0k 
Result 4.11. Suppose X,, X,, ..., X, is arandom sample from the Pareto density f(x | 6) = —— 


+1 


»>xX>k,k 


known. Then natural conjugate prior for 0 is Gamma(a,+1, a,) and the posterior density of 0, given X, is 


also 


canna t+n+l,a, +) log(x, ro 
i=l 
Result 4.12. Suppose X,,X,,..., X, is a random sample from the Pareto density 


) 
f(x|@=—1, (x),  0>0. 
x 


Then the natural conjugate prior for @ is 


B 


2(0) = a OP'T, (8), 


and the posterior density of 0, given X, is 


n +B n+B-1 . 
g(8| x)= aay I, (9); m, = min(x 
m 


oor m), X,, =min(x,,X,,...,X,). 


a)? 
1 

Example 4.16. Suppose a lot containing 1000 items is received from a supplier containing 8 (unknown) 
defective items. The past experiences with this supplier suggest that 5% of items in a lot are defective. 
Suppose we are told that each item he produces has probability 0.05 of being defective, and defectives 
occur independently. The natural prior to use for 0 is Bin(1000, 0.05). Suppose we select a random sample 
of 10 items from this lot, and let X be the number of defectives in the sample. Then the distribution of X, 
given 9, is hypergeometric having pmf 
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8 \( 1000-8 1000 
f(x | 0) = ; x =0,1,2,...,10. 
x 10-x 10 


The joint distribution of X and 0 is 
f(x, 0) = f(x | ®)g(®) 


1000 
(0.05)°(1—0.05) "°°, 


10 
10 \( 990 

= (0.05)°(0.95)"""*, 
x 0-x 


x=0, 1, 2,..., 10 and 0 =x, x+l, ..., x + 990. Note that there are x defectives and (10—x) non-defectives 
items in the sample. The smallest possible value for 6 is x and the largest is 1000 — (10-x) = 990 + x. 
The marginal pmf of x is obtained by summing over range of 8. Then 


x+990 10 990 


m(x) — y: (0.05)°(0.95)"""* 
ox | x J O-x 
10 0  ( 990 
=| |(0.05)'(0.95)"* )° (0.05)"* (0.95)"""™ 
x 6-x=0 0-x 
10 


= (0.05)*(0.95)""*; x =0,1,...,10, 
».< 


which is a Bin(10, 0.05) distribution. The posterior pmf for @ is 


f(x, 9) = 


m(x) 6-x 


2(0| x)= (0.05)°* (0.95)"""™, 


0 =x, xtl, ..., x +990; which is a Bin(990, 0.05) distribution having a range from x to 990 + x. 

Remark 4.31. Note that the marginal pmf of x and the posterior pmf of 8 both work out to be binomial 
distribution. 

Result 4.13. Suppose g, and g, are two prior densities of 6 belonging to the family of the natural conjugate 


priors for f(x|®). If g(0) = ag, (0)+(—a@)g, (8); O<a<1, then 
(8|x) = wg, (6[x) + (1-w)g,(6|x), 
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where 


m,(x) =| f(x|@)g,(@)d0 and g,(0| x) = FIO) | sa 


° m, (x) 


and 


om, (x) 


O= : 
om, (x) + (1 — a)m, (x) 

Example 4.17. (Diaconis and Ylvisaker (1985)) When a coin is spun on its edge instead of being tossed 
in the air, the proportion of heads is not close to half, but is rather 1/3 or 2/3 because of irregularities in 
the edge. When spinning n times a given coin on its edge, we observe the number of heads X ~ Bin(n, 
0). The prior distribution on 0 is then likely to be bimodal with modes at 1/3 and 2/3, which cannot be 
modelled through a conjugate prior g, ~ Beta(a, 8). A mixture prior distribution g, such as (Beta(10, 20) 
+ Beta (20, 10))/2 is more appropriate. Note that we are taking Beta(10, 20) as the first prior since its mean 
is 1/3 and the other prior as Beta(20, 10) since its mean is 2/3 and having a common variance as 2/279 
which is relatively small. It can be seen that the posterior distribution will be mixture of two Beta 
distributions. 
Remark 4.32. Dalal and Hall (1983) have demonstrated that any prior density for an exponential family 
parameter can be approximated by a mixture of conjugate prior distributions. In practice, it may become 
quite difficult and intractable to specify the conjugate densities and their number so that resulting 
posterior is near the limiting posterior based on a actual single prior. 


4.5 EQUIVALENT PRIOR SAMPLE SIZE 


Let us consider the posterior distribution for the probability of success 8 when n Bernoulli trials 
are performed. If we take prior distribution for 6 as Beta(a, B), the posterior mean of 0 is 


E(0|x) =(a@+x)a+B+n) 


n(x/n)+(a+B)(a/(a+B)) 


a+B+n 


(4.11) 


since the posterior distribution g(6|x) is a Beta(a + x, B + n-x). It is a weighted average of the sample 
proportion x/n and the prior mean o/(0+B). The weights depend on sample size n and o+f in such a way 


that E(6| x) > x/n as n > 9 fora fixed value of o+8, whereas E(8| x) > a/(a+B), as A+B 4 0 


Qa 
for a fixed value of n. If we write the prior variance of @ as 2B (a+fB+1), then the 
a+B \\ a+B 


representation of E(6|x) given in (4.11) suggests that (0+) may be interpreted as the prior sample size. 
We may also interpret it as “E(6|x) is the maximum likelihood estimate of 6 for data obtained by 
supplementing the real data (x successes out of n trials) by “fictituous data” consisting of & successes 
in a+ trials. Thus we may consider that (0+) is playing the role of prior sample size.” 

Remark 4.33. The interpretation of the prior information as equivalent to sample of size (a+) yielding 
& successes seems to support Haldanes’s proposal for an expression of prior ignorance. Letting o and B 
tend to zero, that is, the equivalent prior sample size becomes zero, should amount to choosing the prior 
distribution of @ proportional to 1/6(1—8) which corresponds to Beta(0, 0) distribution. 
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However, if we consider a random sample XxX; oy Xx, from N(0, 67) for o? known and N(w re 0.) as 


the prior distribution of , then the posterior distribution of 8, given xX, is normal with mean 


= -1 
nx WU, n 1 : 5 n 1 

H, =| +7, > +7; |and variance 06, =| —-—+— |. 
o GO, o OG, o oO, 


2 2 . 
If we denote 6, =o /n,, then the new set of hyperparameters become (n,, M,) and the posterior 


nxt+n lly oO 


parameters (,,0, ) are transformed into } In other words, the hyperparameters 


b 
n+n, n+n, 


nx +n, 


(n,, H,) are changed to [> +N, } We may consider that the prior information is equivalent 


n+n, 
to a sample of size n, = o/ om from a N(8,o°) distribution yielding sample mean ,. If we combine this 


‘equivalent sample’ with the actual sample, we have a composite sample of size n, +n with sample mean 


now, +nx 
aoa ty as suggests that if we start from prior ignorance, represented by hyperparameters (0, 0) 
n,+n 

0 


which is an improper uniform distribution, the ‘equivalent sample’ will produce a posterior distribution 


n,+n n+n, 


> ° . . . . nH, a nx o 
N(u,,6,) and the composite sample produces a posterior distribution N , . 


Remark 4.34. In the above example, the domain of the parameter, n, = 67/o,”, the sample size, may not 
be an integer. 


4.6 EXPONENTIAL FAMILY OF DISTRIBUTIONS 


Result 4.14. For the regular one-parameter exponential family of distributions 
f(x|8) = v(8)u(x) exp(co(®)h(x)), 


the family of conjugate priors is given by 


g(8| t) = (k(t,,7,)) (v(®))" exp(ct,o(6)), 
where T = (T,, T,) are the hyperparameters such that 


k(t,.2,) = | (v@))* exp(ct,0(0))d0 < =. 


Example 4.18. Let X=(X,,X,,...,X,) be a random sample from Bernoulli(9). Since 


“(8|x)=(1-6)" oof X, wes } 


i=l] 
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) 
1-6 } 


the conjugate prior for 6 is 


g(8|t,,7,) = (k(t,,7, yy d-6)” exp ¢ log 


where 


0 


| 0 
k(t,,7,) = | (1-@)” co(s we*, J 


1 
= [ 6° (1-0) d8 = B(t, +1,1, -1, +). 
0 
Hence, the natural conjugate prior for 0 is the beta distribution with hyperparameters T,+1,1,-7,+1. 


Example 4.19. Let x = (X,,X,,...,X,) be a random sample from N(@,1). We have 


(|x) = ea exp| (2% —2n0x +n0° ) 
Tl 


iY ; . 
_| ___ Pas re 12 exp(nx6). 
V20 


ty 3%, ne? = 
Here u(x) = | e'”,v(0)=e™ ,c =n, (8) = 0 and h(x) = X. Hence the conjugate prior for 
21 
6 is 
g(8|7,,7,) =(k(t,,1,))'e” exp(t,0), 
where, 


co 


k(t,,7,) = | exp (-1,0° +71,0)d0 


qT 
“yp, |20 
=e, j— <oo, provided T, #0. 
Tv 
0 


Hence the natural conjugate prior for 8 is the normal distribution with mean t,/2t, and variance 1/t,. 


Result 4.15. Suppose X =(X,,X,,...,X,) is a random sample from a one-parameter exponential family 


f(x|®) and g(@|t,, T,) is the corresponding conjugate prior density of @ then posterior density of 6, given 
X, T 1S 
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wolen=a(é 


1, +n,7,+)) ms) } 
i=l 


Proof. Using Bayes theorem, the posterior density of ®, given x, T, is 


g(t, x) «< &(6|x) g(Olt); x =(x,,....X,), T= (T, T,) 


cx (g(8))"" exp co [s +)" n(x) J 


i=l 
: (2 


Remark 4.35. It may be observed that y h(x,) is a sufficient statistic for 0. 


i=l 


tT, +n,T, + nos) } 


Example 4.20. Let X =(X,,X,,...,X,) be a random sample from Pois(9). Then the likelihood function 
of 0, given X, is 


((8|x) « e  exp(Zx, log 6). 
The natural conjugate prior density for 0 is 
g(0|t,,7,)«<e ” exp(t, log 8). 
Hence, the posterior density of 0, given T,,7, and x, is 


g(Olt,,t,, x) < g(O| tT, +n,t, + Xx.) 


—(n4+ty 0 


«xe exp((Zx, +T,) log 8) 
«x 8" exp(-(n + T, 0) 


which is a gamma density with parameters Xx, +7, +1 and n+t,. 


Remark 4.36. The results for one-parameter exponential family can be easily extended to the k- 
parameter exponential family case. 
Result 4.16. Suppose that X follows the one-parameter exponential family of distribution 
f(x|0) = exp(A(®)B(x) + C(x) + D(®)) 
and that the prior distribution 


2(0)=)> p, exp(a,A(®) + d,D(8) + C(a,,d,)) 


i=l 
where p,>0 and Yp, = 1, is the mixture of natural conjugates. Then, the posterior distribution is the 
mixture with components 
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g.(0| x) =exp(a,A(@)+d D(6)+C(a,,d_)), 


where a. =a, + B(x) and d. =d,+1, and the weights are proportional to exp(c(a,,d,) —c(a,,d_)), 
V1, 2yak, 

Remark 4.37. Mixtures of natural conjugate priors offer a very diverse family of distributions that is 
capable of representing much more varied prior beliefs than a single natural conjugate (for illustration 
refer to Example 4.17). 

Result 4.17. For a one-parameter exponential family, let us define y = co(6) and y = h(x). We have 


f(y| w) =a(y)exp(yy—b(y)), ye Y. 


This transformed density is said to be in canonical form. The corresponding natural conjugate prior for 
y with hyperparameters (n,, y,) is 


a(yln,, y,) =c(n,, y,) exp(n,y,W—n,b(y)); we. (4.12) 

(i) — The posterior density of y is 

ny, +ny, 
BY |0), YoY per¥,) =8] Win, +n," —* |. (4.13) 
n+n, 
) n _ n, 
Gi) EE] [BOYD Ye VY, |= ya" Yo (4.14) 
ow n,t+n n,+n 


Proof. 
(i) g(y | Ny» Vor Ypres ¥,) & Exp(wLy, —nb(y)) exp(n,y,v—n,b(y)) 


= col v (n, 9 (7 | nea 


n+n 


0 


(i) Since J g0y|n,.y,. yyy, dW =1. 
w 


Differentiating both side with respect to y, we get 


o +n, DoYo* BYe I exp [(n,y, +ny, )W-(n+ n,)b(8) | 


n+n 


0 wy 


= ra) 
fons, & ny, ) ~ (n oh, n=) dy = 0. 
oy 
Rearranging the terms, we get the required result. 


) 
Remark 4.38. Note that row } y,- The weighted average form (4.14) of the posterior mean 
ow 
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suggests that the prior parameter, n,, attached to the prior mean, y,, plays an analogous role to the 


sample size n attached to the sample mean y,. 


Example 4.21. For the Bernoulli model in Example (2.1), the canonical form is obtained by putting y = x, 


A) 
» a(y) = 1, bY) = log(1 + e”) and [c(n,, yr! = Beta(n,y, + 1,n,-n,y, + 1). 
i) 


y = log 
1 


Definition 4.7. A pdf (or pmf) with vector parameter 0 = (0, 0,) is said to be a member of two-parameter 
exponential family if 


f(x | @) =u(x)v®,, anero{ E ©, (0, 6000 


i=l 


The conjugate family for 0 =(6,, 0,) with hyperparameter t = (T,, T,, T,) is 


i=l 


g(8|t) =(K(a))(v(8))* exo c,0,(8)t, } 06 @cR’, 


where T is such that 


K(t) = fv)” exo{ Pee (®t, je <o, 
e i=l 


Example 4.22. Let X ~ N(, T) both pt and precision T are unknown. Then 


% % : 
Fal O= |S e{ Sow } 8 = (UL, 7). 
2m 2 
| ae *) 
= ,/— exp] -—x + tx —-—— |. 
20 2 2 


u(x) = 1/V2n, v(6) = Fiew{-*| h(x) =x, h,(x) =x’, 


(8) = Th, ,(8) =T, c, = 1, c, =-1/2. 
Hence it is a member of two-parameter exponential family of distributions. The conjugate prior density 
for 0 = (UL, T) 


Here 


1 2; ° 1 Oy- 2 1 2 
g(U,T|O,,O,,0,) « [a exn( 5 ) cx [te 5 } as cxo(—5 he 


2 


x Tt" exp (tv) /t(2u —1) os uae (u w) } u> > v>0, we (-c,0) 
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where i Sa ae 
2 Qe O, 


Note that the joint conjugate prior density of (u, T) is expressed as a product of a conditional 
normal prior for LL, given T, and a marginal gamma prior for T. 
Remark 4.39. In a two-parameter exponential family, let us denote y, = h,(x), y, =c,0,(), i= 1, 2; then 


fly |W) = aly) exp(y,, + YY, —bOW); W=(Y, Wy) 
This transformed density is in canonical form. 


4.7 MULTIVARIATE NORMAL DISTRIBUTION 


Result 4.18. Let X,, X,,...,X, be a random sample from a k-variate normal distribution with unknown 


mean 9 and known precision matrix r, where r is a non-singular k xk positive definite symmetric matrix. 


If the vector parameter @ is distributed like MVN(w, T) then the posterior distribution of 6, given 


‘ * 2 _ 7 ig 
xX, is MVN(u ,t+nr), where W =(t+nr) ‘(t+ or X), X =(X,,X,,..,X,), X, =— ) Xi, and x, 
~ ~ ~ = n - 


is the jth observation on the X, variable, i = 1, 2, ...,k. 


Proof. The likelihood function of the mean vector 0 is 


|r n/2 Ie : 
wa EXP} —= DL (x, — 8) r(x; — 8) 
ee ee 


i=l 


C(0|x)= 
rn ( 


ox eo" (@— x) nr(0— |. 


since, )° (x, —8)'r(x,—8) =)” (x, -X)'t(x, -¥) + n(x - 0) 1-8). 


i=l i=1 


The posterior distribution of 8, given Xx, is 


1 P = : 
2010) ex9| ~{@ x) nr(O— x) + (8-H) 18 wh] 


1 ih * 
° op] He Y(t+ nry(O-p } 


since 


(8- x) nr(8— x) + (Op) (8-p) = (O- pw) (t+ nr)(O- pw) + (XP) art + nr) (Xp) 
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where 
we =(t+nr) '(ty+ nr x) 
follows from the algebraic identity 
(6-n) A(@-p)+(0-6,) B(@-6,)=(0-6,) (A+B)(0—6,)+(u-0,) A(A+B) 'B(u-8,), 
(4.15) 


where 6, =(A+ B)" (Au+B6, ), 9, U, 8, are kx1 vectors, A and B are kxk symmetric matrices, such 


that (A+B)" exists. 
(for proof see Box and Tiao (1973), page 418) 


Result 4.19. Suppose X,,X,,...,X, be a random sample from a k-variate MVN(0,r), known mean 
vector 8 but unknown precision matrix r. If the prior distribution for r is Wishart(@, Tt) such that 
a > k — 1 and T is a symmetric positive definite matrix then the posterior distribution of r is also 


distributed like Wishart(o,, Tt"), where @& = + n and Tt =T+ y (x, - 9)(x, = 0). 


i=l 
Proof. The likelihood function of the precision matrix r, given X, is 
1 = , 
--)) ,-9)'r(x, -9) 
2 i=l > * 


and the prior distribution of r is 


lar |x) |r ox 


a-k-1 
a 1 
g(r|Q,t)|r| * exp - — uta | 
2 


Therefore, the posterior distribution of r, given x, is 
g(r |x, 0,7) « (r| x)g(r| a, T) 
(n+a-k-1) 


|r]? ool -3{ ome ,-9705,-9 


Since y (x= 9) r(x, —§) is a scalar quantity, we have 


i=1 


Y, &,-9)t(x, -) = | E (x, - 1, 9) 


i=] i=l 


= op (x, — O)(x, -9ys| 


i=l 
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g(r | x,a,t) |r]? sue 5-018 -9F +8) }} 


i=l 


Thus 


which is a kernal of Wishart distribution with parameters O+n and 7’. 


Result 4.20. Let X,,X,,....,X, be arandom sample from a MVN(0, or), where ris a kxk specified 
symmetric positive definite matrix but the mean vector 6 and the scalar @ are unknown. Suppose that 


the joint prior distribution of 8 and @ is such that the conditional distribution of 8, given @, is 


multivariate normal with mean vector H and precision wt where T is a known kxk symmetric positive 
definite matrix and the marginal prior disribution of @ is Gamma(a, 8) such that o, B>O. Then the 
conditional posterior distribution of 6, given @, is MVN( ww, @(T + nr)) and the marginal posterior 
distribution of @ is Gamma(c’, 8B"), where 

We = (t+nr) '(tWL+ nx), of =o + nk/2, 


and 


B=B+)) (%,-¥)'r(x, -¥)/2+(W -w)'0R-p)/2. 


i=l 


Proof. The likelihood function of 6 and @, given X, is 


£8, @| x) oar |” eo] (x, 8) or (x; -6)/2|, 
i=l 
and the joint prior distribution of 6 and @, is 
(8,0) = g(0| @)g() 


k 
—t+a-1 


i , 
 @? exp| -Bo—L@- b) @r(8— W| 


Since tis a kxk matrix and @ is a scalar quantity, | @t | = o* | Tt |, the posterior distribution of 8 and 
(0, 1S 


ae pte 1 1 ; 
2(8,@|x)< oo’ ox ofp (8—) T(8— pL) + 
a le gece eee 


“*y (x, — x) r(x, -x)+ = (6 x) nr(@— | 
2 Bn ee ty 
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nk 
a+—-l 


oT #5 1 a 
<@ > exp [ -oB jo” oxo|-2 (O—p ) o(t + nr)(O— pL )} 
> > ans 
since 
(U— xX) (t+ nr)‘ nr(M— X) = (C+ nr)‘ nr(X— HL) t(X—p) 

=(t+nr)' ((nrX + TH) —(t+ nny) TX-p) 

= (Wo - py t(X—p). 
Note that g(9, w| X) is a product of the kernels of Gamma(atnk/2, 8") and MVN( we , @(T + nr)). 
Result 4.21. Suppose X,, X,, ..., X, is a random sample from a MVN( 8 , r), with both 8 and r unknown. 
Let the joint prior distribution of 8 and r be such that the conditional prior density of 9, given 1, is 


MVN(H., vr), v>0 and the marginal prior density of r is Wishart(@, t), where &@ > k— 1 and tis kxk 


symmetric positive definite matrix. Then the conditional posterior distribution of 9, given r, is 


MVN( we , (n+ v)r) and marginal posterior density of r is Wishart(+n, T°), 


where 
= (n+Vv) ‘(nx +pV) 


and 


‘ _ _, Vn = ae 
t=t+)) (x,-x)(%,-x) + (U—X)(— xX). 
i-l Vtn ~ 


Further the marginal posterior distribution of @ is a multivariate t-density with (a + n — k+1)df, 


location parameter 1", and scale parameter (v+n)(o+n—k+1)t"7. 


Proof. The likelihood function of 8 and r is 


(8,1 | x) or | oxp|-2 3" (x, 8) r(x, 0) 


i=l 


and the joint prior density of @ andr is 


g(0,r) = g(8|r)g(r) 


110 Bayesian Parametric Inference 


1/2 ly , a-k-1)/2 1 
or | “eo a (8 1) vr(® w ict exo| ecm 


i=l 


Since y (x, — 9) r(x, — 8) = (8- x) nr(® p+y ==) 


=(0- x) nr(0— x) + tr(sr), 


and 


(0~ X)‘nr(8~ X) + (G~ p)’vr(O~ pw) = (= p(n + vr pe) + (K — )‘ne(ne + ve) ve(® — p), 


where 
a) oe), 
bw =(nr+vr)" (x nr + Vr) =(n+v)' (nx + VL), and 


nr(nr+vr) vr = nvr /(n+Vv), 


the posterior distribution of 8 andr, given x, is 


g(8,r | x) or [OM ox 4x 0) r(x, — 8) + (@-p) vr(8 ween} 


= Fa lai e0| ~{(0 x) nr(0 5) + (0) + (@-'vr6-w) +19} 
2 me ~ ~ ~ ~ ‘td bias ~ 


; 1 j ; 
=r [rr exp [+f +7)r+(0—p") (n+ v)r(O—w" ) + (X-p) ual }e - of 
2 ~~ ~ ~~ | nev 


= i r[” exp PaG bw) (n+v)r(8- pw Ml [ ree exo| Lien | 


which is a product of the kernels of the conditional posterior density of 9, given r, and the marginal 
posterior density of r. 


In order to work out the marginal posterior density of 8, we obtain the marginal prior density 
of 8 and transform the prior hyperparameters 1, o and T into the corresponding posterior parameters 


ie a+n,t. The marginal prior distribution of 0 is 
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2(8)=| 2(8,r)dr 


1 , 
oo | | r i ex Lem ar (8— HW) vr(0— 0] dr 


a-k 


11 
+ 1 
> 2? exp - - ween | dr, 
2 


=| |r 


where 


1, =T+V(8-H)(O-p). 


On using the identity 


C 
‘| =|A||[D-BA“C| for B= (6-w)’, C=-(0-p), D=1 and A=T/y, 


and the fact that the integrand is a kernel of Wishart distribution, we have 
—(a4+1)/2 


g(8) « |t+v(8-p)(B—py 


—(a+1)/2 
—(a+1)/2 
es] T 


1+(O-p) vt (8-H) 


(e- wy vio fe 1 - kt! (e- uw) —(a+1-k+k)/2 
x | | +—— = ‘ 
(a+1—k) 


which is a multivariate t-density with (a+1-k) df, location parameter UM, and scale parameter 


v(a+1-k)t"'. Thus the marginal posterior density of § is multivariate t-density with (a+n+1-k) df, 


location parameter LU’ , and scale parameter (v+n) (&+n+1-k) Tt”. 
Remark 4.40. The joint distribution of mean vector @ and precision matrix r is said to be normal- 
Wishart distribution if the conditional distribution of @, given r, is a MVN (HW, Vr) and the marginal 


distribution of the precision matrix r is Wishart(a, tT). If we let v > 0,a@— —1 and T tends to a zero 


matrix then the conditional posterior distribution of 8, given r, tends to MVN (X,nr) and the marginal 


posterior distribution of r tends to Wishart (n—1, s). This limiting posterior distribution is obtained when 
the hyperparameter o violates the condition a>k—1, and @ must approach the value —1. This limiting 
posterior distribution is obtained for improper Jeffrey’s prior g(0, r) < [r[/“**?? which is meaningful when 
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we take @ and r, a-priori independent such that g(8) is uniform over k-dimensional Euclidean space and 
g(r) rp er?, 

Example 4.23 (O’Hagan and Forster, 2004) Suppose X,, X,,....X, and Y,, Y,,....Y, be two independent 

random samples from normal distributions with means lu and +d, respectively, and having a common 

known variance v. Let p and 6 be, a-priori, independent and their prior distributions are N(m, @) and 

N(d, g), respectively. Then the posterior distribution of 5 is N(d,, g,), where 


d, =(c.g(y —x)+c,g(y —m) +c,vd)/((c, +c, )g +c,V), 
g, =c,vg l(c, +c, )g+c,Vv), 


and c,=nn,@, c,=n,V, c,=(n,—-n,)O+V. 
Proof. The likelihood function of and Lt + 6 is 


1( w-x \(v/n, 0 )'( p-x 
(ut 8| x, y) < exp] -— 7 
2\u+6-y 0 v/n, ut+ds-y 


and the joint prior distribution of 1 and 4 is bivariate normal distribution with mean vector (m, d)' and 


o 0 
covariance matrix } Hence, the joint prior density of u and +6 is bivariate normal distribution 


g 


100) 


@ 
with mean vector (m, m+d)' and covariance matrix } Therefore the joint posterior 
oO @® 


tg 


distribution of 1 and UW + 4 is a bivariate normal with mean 


<2 1 (n,g+V)V@ vO oe oe 
g(c,+c,)+¢,V vo V(gv + OV + on Oe n,y , 
2 Vv 
and the covariance marix 
Vv g@ (n,g+v)/vg l/g 
g(c, “eo | l/g ae 


since 


(a oats) 


Hence, the joint posterior distribution of p and 6 is also bivariate normal distribution with the mean 
vector 


m, 1 v@+n,go vo 1 ( ngv—dv@+n,xog 
d, ~ g(c, +¢,) +¢,V —n,g@ gv +n, og }wg dav +n, yog 
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and the covariance matrix 


1 va(n,g +Vv) —n,vg@ 
g(c,+c,)+ce,v_ —n,vg® vg(n,@+n,@+Vv) 
Thus, the marginal posterior distribution of 5 is a normal with mean 


1 _ = 
, = |, g0(ngv — dv + 1, xg) + (gv +, 0g)(dov + 1, YOS)] 
(g(c, +c,)+¢,Vv)@g 


= (c.g(y — x) +c,g(y —m) +dvc,), 
g(c,+c,)+c,V 


and 


i : 
variance = ———————— (n, vg@+n,Vvg@+V g) 
g(c, +c,)+¢,Vv 


c,vg 


g(c, +c,)+c,V 


1/3 -1/3 


Example 4.24. (DeGroot, 1971) Let X ~ BVN(0,r) with r = 
° 7 -1/3 4/3 


) and @ ~ BVN(u,t) with 


1 


-1 
precision marix -( i é } How large a sample must be taken in order that the variance of 


posterior distribution of (8,—0,) will be reduced to 0.01? 


Solution. The posterior distribution of 9, given x, is N(w’,t+nr), where 


wt =(t+nr) (qm +a8 x)= (UH). To find the posterior distribution of 6 —8,, we note 


E(@,—98, |x) =H, -H, 
and 
0, 1 
va@,-2, = Val of) J Jeevan | _ 
1 
=a-ne+a'[ | 
Since 


1 -l 1/3 1/3 jf n+3 —n-3 
tT+nr= +n =— ; 
-1 6 -1/3 4/3 3{|-—-n-3 4n+18 
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. ] 4n+18 n+3 
tt+nor) =————— ; 
(n+3)\n+5)\ n+3 n+3 


So, War(0,-8,|x)= . 
~  n+3 

The minimum size of the sample required to reduce the posterior variance of @,—0, to the value 0.01 is 
given by 


3 
Var(0,-98,|x)<0.01 or < 0.01. 
. n+3 


Hence 
n= 297. 
Example 4.25. (DeGroot, 1971) Let x ~ BVN(6, Tr), r unknown, 8 known. If the prior distribution g(r) 


is Wishart (3, T), find the sample size so that coefficient of variation of the posterior distribution of 
|r| will be 0.1. 
Solution. On using Result 4.19 and the fact that the coefficient of variation of | r | is 


(a —k+2\(a —k+1) 


[ 2(2n+7) 
201 
(3+n)(n + 2) 


n —395n-1394<0 
which gives n > 399 (other root is negative). 


: 1/2 
ka -k+3 
( ) , the required sample is given by 


4.8 A CONJUGATE FAMILY FOR MULTINOMIAL DISTRIBUTION 


139,005 O,)s 


Result 4.22. Suppose X ~ Multinomial(n,6), where X=(X,,X,,...,X,,), 0=(0,,0 ra) 


k k 
0<6.<1, y 0. =1, and y X, =n, with pmf 


i=l i=l 
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which is Dirichlet (H, L,) distribution, where = (Hy MH) and 
Tp .u,..Cu, Pu, 
k-1 
aps M, +H, ) 
i=l 


where i, > 0;i1= 1, 2, ..., k. The posterior distribution of 8, given X, is 


c) 


c' =DQLH,) = 


k-1 k-1 >) Xi = kel yl 
wolv{ETo (Le) [Ete (Le J" 
i=l i=l isl i=l 


p[urse Es 1] 


i=l 


i=l 


which is Dirichlet « 


H+ x,M, roy } 


4.9 NUISANCE PARAMETERS 


A nuisance parameter may be defined as a parameter that is included in the probability model 
of the experiment at hand because it is necessary for the good fit of the model, but that is not of 
primary interest to the investigator. In the simplest setting of the problem, the probability model for 
the experiment has a parameter pair (8,, 08,) where 0, and 9, are real (or vector) valued, and 9, is the 
nuisance parameter. 

In the Bayesian approach, inference about the parameter 9, is completely determined by the 
posterior distribution of 8, obtained by integrating out the nuisance parameter 9, from the joint 
posterior distribution of 6, and 6,. Thus 


2(0,|x) =] 2(0,.0, |x)d0, 


=J (0, |8,,x)g(O, | x)d@,. 
This suggests that the posterior distribution of the nuisance parameter acts as a weight function 
multiplying the conditional posterior distribution g(0, | 8,, X) of the parameter of interest. 


Remark 4.41. When we integrate out the nuisance parameter, we are not throwing away any information 
relevant to the parameters we are interested in. On the contrary, probability theory automatically takes 
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care of all the available information about the nuisance parameter in the form of marginal posterior pdf 
of 8,. 

Remark 4.42. One should be cautious if the conditional posterior distribution of 0, is sensitive to the 
choice of values of the nuisance parameter 8,. In particular, if the marginal posterior distribution of 0, 


A 


is concentrated over a small region about its mode 0, then integrating out 8, would be nearly 


equivalent to assigning @, = 6, in g(®, | ®,,x). This will mean that g(®, | x) = (6, |9,,x). Thus, we 
get the same conclusions that we should have, if the true value were known from the start. 


On the other hand, if g(8, | x) is rather flat, that is, the prior information and the sample 


information about the nuisance parameter ®, are very weak, the sensitivity of g(8,,8, | x) to 6, would 


suggest that more information about 8, should be obtained so that inferences about 0, could be 
improved. Therefore, implementation of the Bayesian approach to the nuisance parameter requires us 
to specify information about the nuisance parameter as well as about the parameters of interest. 


Chapter 5 


Non-Informative Priors 


5.1 INTRODUCTION 


Quite often, the derivation of the prior distribution based on information other than the current data 
is impossible. Moreover, the statistician may be required to employ as little subjective input as possible 
so the conclusion may appear solely based on sampling model and the current data. 

A non-informative prior is one in which little new explanatory power about the unknown 
parameter is provided by intention. 

It is not easy to identify a non-informative (or objective) prior distribution which may represent 
prior ignorance or vague prior knowledge and provides solely data dependent conclusions. Infact, every 
prior specification has some information or predictive implications and, therefore, ‘vague’ is not a very 
useful term to represent lack of knowledge. There is no objective prior that may represent total 
ignorance. From practical point of view, the notion of vague prior may be considered as a prior which 
has minimal effect, relative to the data on the final inference/decision. 

A variety of criteria are suggested for comparing methods of producing a non-informative prior. 
Simplicity, generality, and trustworthiness are the most important of them. The traditional approach to 
construct non-informative priors used by Laplace, Jeffreys, Lindley, and others, is that the method 
should be modified or adjusted in order to obtain solutions to a problem for which the method fails. 
Historically, Bayes and Laplace suggested the uniform prior distribution which was later found to 
depend upon the choice of the parameter. Jeffreys approached the problem by suggesting rules to 
construct invariant priors. Difficulties for multiparameter problems were later resolved by Jeffreys 
through ad-hoc modifications to the prior. 

A prior probability distribution that represents perfect ignorance or indifference would produce 
a posterior probability distribution that represents what one should need about the parameter 0 on the 
basis of the evidence (data) X alone. Such a prior is called “neutral” or non-informative prior by Royall 
(1997, page 173). According to Jeffreys (1983, page 117), non-informative priors provide a formal way 
of expressing ignorance of the value of the parameter over the permitted range. 

The efforts to construct priors that may represent the absence of knowledge have failed because 
no probability distribution can represent pure ignorance. In fact, every probability distribution represents 
a particular state of uncertain knowledge. For example, when we say that the probability of 6 lying 
between 1/2 and 2/3 is 1/6, or that it is any other value, we are expressing our inability to specify a 
value for that probability. Philosophically speaking, there is a difference between the statements: 

(i) | I do not know which of two possible values of 0 is true. 
(i) I have no prior evidence about which is true. 

In fact, the assertion that the two values are equally probable is quite different from the previous 

two statements. 
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There is neither one accepted definition of the word non-informative prior nor one accepted 
criterion for choosing a non-informative prior. Laplace’s principle of insufficient reason to select a 
uniform prior, g(8) = 1 for all @e ©, has been criticized since the times of Boole and Venn due to lack 
of transformation invariance property. Lindley (1956), Zellner (1977), Bernardo (1979) and Akaike (1983) 
have used information theoretic considerations to construct non-informative priors. Zellner’s procedure 
to construct non-informative priors is often considered impractical in the sense that it is difficult to 
interpret priors in relation to the posterior distribution. 

Another commonly used procedure developed by Novick (1969) depends on limiting forms for 
the conjugate prior distributions. In this procedure, the hyperparameter of the conjugate prior is made 
to approach some limiting value. For example, if X,, X,, ..., X, is a random sample from N(Q, r), precision 
r known, and the conjugate prior for 8 is N(t, tT), then the posterior distribution of 0 is 


TU + nrx 
ters 


oe =| which tends to N (x, nr) as the prior precision t > 0. It should be noted that 
T+ nr 


this limiting posterior distribution cannot be derived from any proper prior distribution (see Remark 
4.21). 
In case, both @ and r are unknown, the Normal-Gamma prior with hyperparameters wu, T, a, and 


TL + 1X 


B, we have the conditional posterior distribution of 8, given r, as N (T+ on which tends 


T+n 


to N(x,nr) as t->0, andasa@—-1/2, B 40 the marginal posterior distribution of r tends to 


n-lile¢ 3 
Gamma [Ske —x) } provided n>2. This limiting posterior distribution results from the 
2 2 


i=l 
improper prior g(0,r) « 1/r. It should be noticed that the condition a>O for the hyperparameter of the 


gamma prior must be violated. Such prior distributions are sometimes called ‘nil’ prior distributions. One 
of the well known examples is Haldane’s nil prior., g(8) «< @7(1—0)"' for the probability of success in a 
Bernoullian trial. 

The fact that non-informative priors are not unique is visible from the different non-informative 
priors obtained by using Jeffreys’ approach, Zellner’s Maximal data informative approach, and the 
classical Bayes-Laplace convention for the binomial parameter 8. An interesting discussion is given 
by Geisser (1984). 

Remark 5.1. An improper prior may be considered as a weight function that sums (or integrates) over 
the parameter space to a value other than one. If this value is finite then the considered improper prior 
can be made to induce a proper prior by normalising the weight function. However, when the integral 
or sum is either infinite or does not exist, an improper prior remains improper and plays the role of a 
weighting function. 

Remark 5.2. The use of ignorance priors is often justified on the basis of diminishing effect of priors 
on the posterior distribution for large samples. In the case of nuisance parameters, ignorance priors 
are suitable and convenient since they are integrated out in the final stage of analysis. 

Remark 5.3. Sometimes, improper priors may lead to badly behaved posteriors and paradoxes like 
marginalisation paradox of Dawid, Stone, and Zidek (1973). 

Remark 5.4. Some Bayesians argue in favour of ignoring or violating the likelihood principle. One of 
the reasons being that the experiment induces or defines the parameter(s), from which it follows that 
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the prior must be dependent on the experiment. However, sometimes priors that depend on stopping 
rules can lead to perplexing alternatives. For example, consider experimentation producing a series of 
binary data. If somehow due to cost or disinterest the investigator decides to terminate the experiment, 
an immediate question comes to the mind ‘Would the Bayesian change his original prior to accommodate 
the new sampling rule? Would he adhere to the prior that depends on the experiment that was not 
conducted or would he assert that the experiment cannot be analyzed because the original scheme was 
not fulfilled?’ One such scheme could be that he terminates the experiment when either r failures or n 
trials are conducted whichever occurs first. What impartial prior would a sampling rule-dependent 
Bayesian use here? 


5.2. COMPLETE IGNORANCE 


The objective notion of probability, also called ‘logical’ or ‘necessary’, is that P(E|I) represents 
a degree of belief in the event E based on information I. Note that an individual may not choose it as 
his personal degree of belief. It is a unique objective measure of the degree to which E is logically 
obtained by the evidence. Furthermore, it does not require E to be repeatable. The objective probability 
is applicable to parameters in statistical models where posterior distributions are constructed and 
inferences are drawn using the Bayes theorem. The inferences for 9, thus obtained, are logically implied 
by the data and prior information. 

If we consider Bayes theorem as a device to improve the accuracy of specifying the probability 
then if any substantive prior information is available, we may regard the prior distribution as instead 
a posterior distribution. Thus, it should be possible to deduce prior from posterior using Bayes theorem 
in reverse to arrive at a state of no information. The objective approach, therefore, starts with the task 
of finding logically consistent and realistic representation of “complete” prior ignorance about 9. 

According to Poincare (1905) who was a subjectivist, complete ignorance cannot exist because 
absolute ignorance cannot provide any probability at all. Thus in Poincare’s terms, if the depth of 
ignorance of an investigator is great then there is sense in which his beliefs approximate to some ideal 
(if unattainable) state of total ignorance. 

According to Jaynes (2003), an objectivist, the natural starting point in translating a number of 
pieces of prior information uniquely into a prior probability assignment is the state of complete 
ignorance just as zero is the natural starting point in adding a column of numbers. In fact, complete 
ignorance is an ideal limiting case of real prior information just as a perfect triangle is an ideal limiting 
case of real triangles made by surveyors. 

Rev. Thomas Bayes (1763) and Laplace (1774) expressed complete ignorance by assigning uniform 
prior probability distribution for the unknown parameter(s) of the model. Laplace said “when the 
probability of simple event is unknown, we may suppose all values between 0 and 1| equally likely.” 

Let us consider the case of Haldane’s (1931) nil-prior 


2(0)=0 (1-8), 0<6<I1, (5.1) 


as a complete initial ignorance prior for 8, where 8 is the probability of success in a Bernoulli trial. This 
prior was anticipated by Jeffreys and Jaynes obtained it as a complete ignorance prior using group 
invariance approach. 

If X represents the total number of successes in n iid Bernoulli trials then the posterior 
distribution of 0 is 
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r-l n-r-l 
O\xso= r2Ln-r21, 0e[0,1]. (5.2) 
Ba,n-r) 
Suppose that we observed two tosses of a coin which resulted in one success and one failure 
(that is, r= 1, n-r= 1). Then g(6|x) = 1 for all 6€[0, 1]. If we consider the Bayes-Laplace uniform prior 
obtained as a posterior distribution of 0 after observing two tosses of a coin then it cannot describe 
the state of complete ignorance, because it was obtained after observing one success and one failure 
in the two tosses of a coin. The uniform prior thus assures the prior information that the experiment 
will yield either a success or a failure while the Haldane’s nil-prior (5.1) describes a pre-prior state of 
knowledge in which we are not even sure of that. The reason is that for r = 0 or r = n in (5.2), the 


posterior distribution becomes un-normalizable and is proportional to either @'(1—@)"' or 


6" '(1—6)', respectively, and the weight is concentrated on the value 6 = 0 or @ = 1. According to 


Jeffreys, the prior should give greater weight to the end points 6 = 0 and | if the theory is to account 
for the inferences made by a scientist. Thus, the choice of the appropriate prior distribution depends 
on the exact prior information available. 

Remark 5.5. Suppose X ~ Bin(n, 9) and the prior for 6 is Beta(a, B). Since the posterior distribution 
of 6 is Beta(a+x, B+n—x), the posterior mean as an estimate of 0 is (& + x)/(a+B+n). The classical 
maximum likelihood estimate of 8 is x/n, which may be obtained by letting and f tend to zero. Thus, 
we may compromise with the classical statisticians by taking x/n as a formal Bayes estimate of 0 for 
the price of assuming prior for 8 as an Haldane’s improper prior. 

Remark 5.6. Jaynes (2003, page 383) examines the problem of obtaining complete initial ignorance priors 
using transformation groups and obtains Haldane’s prior (5.1) to represent the state of total confusion 
or complete ignorance for the probability of success in n Bernoulli trials. 


5.3. UNIFORM PRIOR 


The suggestion of Laplace that one may take uniform distribution for the unknown parameter 0 
in the absence of sufficient reason for assigning unequal probabilities to the values in the parameter 
space had created a lot of discomfort for the users of Bayes theorem for inferential purposes. One of 
the major difficulties which the critics of Bayesian approach point out is that uniform priors are not 
invariant under transformation. This trouble may occur for improper as well as proper uniform priors. 
Example 5.1. (Improper uniform prior) 

Let us consider the uniform prior for the standard deviation o of the normal distribution. If we 
take g(0) = c, o = 0 and consider the transformation n= logo. This transformation makes 1 € (—»9, °°). 


The Jacobian of transformation 


J= 


cy 
on 


Since it accounts for the rate of change 


n=h(o) =logo gives h (n)=o=e" 


=ce" 


1 a) 4 7 
so. -g(N) = g(h open a = g(9)|——e 
on ) 
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The resulting prior makes a strong statement about values that are a-priori more likely than others and 
therefore, does not represent lack of information. 


Example 5.2. (Proper Uniform prior) 
Let @ be the probability of success in a Bernoulli trial. The Bayes-Laplace prior for 8 is U(O, 1), 


that is, (8) = 1, for © € [0, 1]. Consider the transformation @=0/(1—8), where € [0,00). If we write 
o = h(6) = 6/(1—8), then 
h'(o) = o/(1+). 


Hence 


a 6 1 
do 1+6] (+0) 
This result clearly shows a serious departure from the fact that no prior information about 8 implies 
no prior information about a simple transformation of 0. 
Remark 5.7. The uniform prior is invariant under linear transformations of @ but not under other one- 
to-one transformations. Real valued functions such as 6°, 6 or sinh® do not have uniform densities. 
If we are completely ignorant about the value of 8, however, then we seem to be equally ignorant about 
the value of 6°. Advocates of non-informative priors respond to this lack of invariance by arguing that 
the appropriate non-informative prior for 8 must depend not only on the mathematical form of parameter 
space but also on the role of 6 in indexing the sampling densities f(x|@). The uniform prior is 
appropriate for location parameter 0 but not for 6°, 8"! or sinh®, since these are no more location 
parameters. However, there are strong objections to the dependence of non-informative priors on 
sampling models. For example, why should the model of ignorance about @ depend upon what 
statistical experiment eventually be carried out to provide information about 8? A variety of experiments 
may be feasible as in the case of tossing of a coin. 

The above approach applies only to statistical problems. What do we do, if we are completely 
ignorant about a quantity which is not a statistical parameter? 
Remark 5.8. Fisher criticized Student (W.S. Gosset) for his use of uniform prior on a binomial parameter 
saying that his prior does not imply a uniform prior on the binomial parameter raised to the fifth power. 
Student replied that he has no concern about the fifth power of parameter, an irrelevant transformation. 
C.R. Rao (1987) comments “The choice of metric naturally depends on a particular problem under 
investigation, and invariance may or may not be relevant.” James Berger (1985) remarks “The major 
problem with invariance concerns the amount of invariance that can be used.” Zellner (1997) is satisfied 
with invariance of the priors with respect to relevant transformations. 
Remark 5.9. Box and Tiao (1973) and Bernardo (1979b) have argued that a non-informative prior should 
be regarded as a reference prior, i.e., a prior which is convenient to use as a standard in analyzing 
statistical data. The obvious question is “why should one choose a single prior as a standard, in 
particular, uniform prior? They say that non-informative priors are suitable reference standards because 
they produce reference posterior distributions which approximately describe the kind of inferences which 
we are entitled to make with relevant initial information. Their argument is based on the assumption 
that little initial information should be, modelled by a non-informative prior, atleast as good 
approximation to some proper prior with a high degree of uncertainty. 

Another argument in favour of uniform prior is that when the data are sufficiently informative 
so that likelihood function is sharply peaked then it really does not matter what prior is used since all 


1 0 -1 
g(o) = g(h wan @)- 
do 
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reasonably smooth prior densities will lead to approximately the same posterior density. The uniform 
density, in most cases, is convenient to simplify calculations of the posterior. This argument supports 
the uniform prior only in those cases where it produces approximately the same conclusions as the 
highly imprecise prior constructed from a sufficiently large class of prior densities. If the data are highly 
informative, the uniform prior may produce reasonable inferences. 
Remark 5.10. Non-informative priors have strong implications for behaviour and, therefore, should not 
be considered non-informative. Furthermore, they may not represent the prior probabilities when the 
non-informative priors are improper. The basic problem is that no precise probability distribution can 
adequately represent ignorance since complete ignorance can be properly modelled by the vaccuous 
probabilities and near-ignorance by near vaccuous probabilities. Walley (1991) thinks that non- 
informative priors are used and defended due to some combination of the following: 

(i) | The problem of little or no information is important in theory and is common in practice. 

(ii) A belief in the philosophy that any state of uncertainty, even complete ignorance can be 
represented by some precise probability distribution. 

(iii) | Some desirable property such as invariance holds for a non-informative prior. 

(iv) They do not require assessments of prior information from the user. 

(v) Objective statistical methods require objective or logical prior probabilities. 

(vi) In some important problems, inferences based on non-informative prior are numerically identical 
to classical inferences such as confidence intervals. This may give the impression that a Bayesian 
could reproduce the ‘successes’ of frequentist inferences, and therefore conform that non- 
informative priors give reasonable answers. 

(vii) Adopting an uniform prior density allows us to interpret normalised likelihood function as a 
posterior density which makes the computation simple. 

A variety of rules have been developed for obtaining priors to express little or no information 
regarding the parameter 0. Jeffreys invoked invariance, Box and Tiao recommended priors such that 
likelihoods are data translated, Akaike (1978) and Geisser (1979) formulated procedures involving the 
predictive distribution and Kullback-Leibler divergence measures, respectively. Bernardo (1979) used the 
notion of maximising entropy in the limit, whereas, Zellner (1977) maximised the Shannon’s information 
of the data relative to that of prior. 


5.4 JEFFREYS’ NON-INFORMATIVE PRIORS 
Thumb Rules for Non-Informative Priors 


Harold Jeffreys suggested a thumb rule for specifying non-informative prior for a scalar parameter 
8 as follows: 


Rule 1. If 6 [a,b], where a and b are finite numbers or 0 € (—°9, 00) , then 
g(8) = constant. 
Rule 2. If 6 (0,00) , assume log@ to be uniformly distributed over the whole real line. This implies, 


using transformation of variables, 
g(8) « 1/0. 
Remark 5.11. Rule | is invariant under any linear transformation 6 = c8 + d, c #0, whereas, Rule 2 


is invariant under exponential transformation = 0*,k #0. 
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Jeffreys and others make extensive use of improper prior pdfs to represent “knowing little”. 
According to Jeffreys the use of improper prior poses no difficulty because Renyi’s axioms and his 
accompanying definitions of conditional probability allows statement of Bayes theorem even when 


improper priors are employed. A prior distribution is improper if integral | g(8)dO is not finite. 
Jeffreys’ thumb rule for representing ignorance about the location parameter 8 assuming values 
from —co to oo is g(0)d0 « dO, co < 8 < oo, 1.e., (8) « constant. Note that g(8) is improper since 


co 


| g(0)d0 =. 


so 


It is interesting to note that Jeffreys used infinity to represent the probability of sure event, 


P(a<@<b) O aa . 
—————— = -,, which is indeterminate when a, b, c, and d 
P(c<O@<d) O 
are any finite numbers, one cannot make any statement about the odds that 0 lies in any particular 
pair of finite intervals. Thus g(®) « constant may be considered as a formal representation of ignorance. 


—oo <@< oo, rather than unity. Since 


Jeffreys’ second thumb rule is concerned with the parameters having a range Q too, in 


particular, for a scale parameter like standard deviation o of a normal distribution. Since 6 = logo has 
a range —co to oo, the first thumb rule suggests that the prior for © should be 


g(8)d0 « d6, 00 <9 <0, 


Therefore, g(o)do « do/o, 0<0<~%, (3:3) 
This improper distribution may be taken to represent ignorance about o. Note that the improper prior 
(5.3) is invariant to the transformation of the form @ = o*. Thus, 

do/) « do/o. 

Such transformation of the standard deviation o is useful in inferential problems since investigator 
may like to work with variance o? or the precision r = 1/o’. It can easily be checked that the prior 
do/o logically implies do/o « do?/o* « dr/r. All these three representation of the scale parameter of the 
normal distribution are consistent with one another and also the posterior probability statements 
involving 6, 67, and r will be consistent. 

The improper prior g(8) « 1/0, Be (0, ), has the following properties: 


7 dé 
(i) | — =o, which indicates that oo represents certainty as in the earlier case, 
0 
0 


+ dO 
(ii) | ey gad 
0 


(ai) | oe 
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P(O<@<a) © ean ; 
————————_ = — , which is indeterminate and, therefore, 


Properties (ii) and (iii) together imply that = 
P(a<@<c) © 


nothing can be said about the odds pertaining to the events 0<@<a and a<@<o. 
Jeffreys considered g(8) « constant when 0 <@<oo, as an unacceptable representation about 


the value of 0. If we take g(®) ~ constant as the prior then P(® > c), where 0 <c¢ < oo, is infinity, which 


amounts to Jeffreys’ certainty. This should imply that P(O <@<c)=0 which clearly implies that we 


know something about 9. 


Remark 5.12. It may be mentioned that g(0) « 1/6 implies g(@°) « 1/0" and, in general, g(0") « 1/6”. 


Thus the same form of pdf is used to express ignorance about any power of 9. The posterior 
probabilities relating to 8 and 8" will be consistent. 


Jeffreys’ Non-Informative Invariant Prior 


Jeffreys was motivated by invariance requirements and suggested a solution to provide a non- 
subjective prior. He used differential geometry methods. The requirements are invariance under 1-1 
transformations and invariance under sufficient statistic. Kass (1989) offered a heuristic explanation 
based on the idea with ‘natural’ volume elements, defined in terms of Fisher’s information matrix, should 
have equal probability. One dimensional version of Jeffreys’ prior has been justified from many different 
view-points. 

Jeffreys’ prior does not reflect lack of knowledge but in the case of one parameter and under 
some regularity conditions, Jeffreys’ prior describes the type of prior knowledge which would make the 
data as posterior dominant as possible. The posterior distribution based on Jeffreys’ prior may then 
be used as a benchmark or a reference for the class of posterior distributions which may be obtained 
from other priors. 

Remark 5.13. Jeffreys’ prior is nearly uniformly distributed for the location parameter. 
In the 1946 paper, Jeffreys’ proposed a formal rule for obtaining a non-informative prior as: 
In case 9 is a k-vector valued parameter, 


g(8) < ./ | det 1(6) |, (5.4) 


2 


where I(8) is a kxk Fisher’s (information) matrix whose (i, j)th element is +|3 3 


log £(8| »| : 
ij=l,2,..,k. (5.5) 
In particular, if @ is a scalar parameter, Jeffreys’ non-informative prior for 9 is 


g(@) < 4/ 1(0). (5.6) 


Remark 5.14. Fisher’s information matrix is not directly related to the notion of lack of information. 
The connection comes from the role of Fisher’s matrix in asymptotic theory. Actually we should call 
Fisher’s information matrix as Fisher’s matrix. 

Remark 5.15. Jeffreys’ non-informative priors based on Fisher’s information matrix often lead to a 
family of improper priors. 
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Remark 5.16. IJ. Good (1969) derived Jeffreys’ prior as the least favourable initial distribution with 
respect to a logarithmic scoring rule, in the sense that it minimizes the expected score from reporting 
the true distribution. (See Bernardo and Smith (1994), page 358). 


1 1 
Remark 5.17. The Beta =. :) prior for the Binomial parameter 0 by the usual change of variable 
22 


tule g() = g(9) 


dé 
d 


1 1 
implies uniform prior for @ = sin™ Je . The distribution g(8) = Beta G. 7 is 
sometimes called the arc-sine distribution. The transformation o=h(6) puts the likelihood in data 

eee 1] 
translated form, and hence that a uniform prior in 0, 1.e., Beta] —,— | prior for 8, may be considered 
2 2 
as an appropriate reference prior. 
Remark 5.18. One should be cautious in applying Jeffreys’ method since the expectation defining the 
information may not exist. For example, when the sampling distribution is Cauchy(9, 1). 


Invariance under reparametrization 


Result 5.1. Let @ be a scalar parameter and let @ = h(®) be 1-1 transformation. Then 


I(o) = (8) = : (5.7) 
do 


Note that computing the Jeffreys prior for the transformed parameter @ directly produces the same 
answer as instead computing the Jeffreys prior for 8 and subsequently performing the usual Jacobian 
transformation to the @ scale. The Jeffreys prior for is given by 


I(o) =-E'“” [Soe f(x |) } 
do 


d’ d de d do) 
Since — log f(x | 6) = — log f (x | 9) —+—— log f(x |0)| — |}, 
do dé do dd do 


2 


dé do 
— and —— are constants with respect to x, and the expectation of the score-statistic 
do 4 
d in d d 
—log f(x |) = [ —£(x|@) dx = —| f(x |@)dx =0, 
dé ~, \ dO dé 
we have 


1() =-E'°" < log f (3) 
~ log f(x | ®) 
dé” do 
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de) 
=1(8)| — |. 
do 
Hence Jeffreys’ non-informative prior for @ = h(8) is 
dé 
(>) = g(®)|—. (5.8) 
do 
Remark 5.19. This result shows that the priors defined on 6 and @ are transformed according to the 
changed variables formula. Thus it does not require any specific parameterization, which could in many 
problems be rather arbitrary. In this sense, the Jeffreys’ rule for obtaining non-informative priors is quite 
general. 
Remark 5.20. According to Jeffreys’ non-informative priors should be chosen by convention rather 


than to represent ignorance uniquely. 
Example 5.3. Let X,, X,, .... X, be iid Bernoulli random variables with probability of success 8. Since 


f(x | 6) = 6*(1-6)'*,x =0,1 and 9€ (0,1), the Fisher’s information 


1(6) =-+] ero | ®) 
00 


=)" e”-6)* = (x log 6+ (1— x) log(1—9)) 


x=0 


= @'(1-6)". 
Hence the Jeffreys’ prior for 0 is 
(0) Ge 6? (d- ey” ; (5.9) 


1 1 
which is a proper Beta - ; :) distribution. 
22 


Example 5.4. Consider an experiment which consists of counting the number x of Bernoulli trials which 
are necessary to be performed in order to observe a prespecified number r(21) of successes. The 
probability model in this case is negative binomial having pmf 


x-l 
f(x |]0)= 6'(1-8)*"; X=r,rtl,... 
r-l 
The Fisher’s information is 
© 
1(0)=)° f(x |) a log f (x | 8) 


oo = . = 3 x 
=)-) a'(1-0)""| —+ — |= 107 (1-6)", 
= r-l 0 (1-6) 
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Hence the Jeffreys’ prior for 0 is 


g(8) « 6'(1- 6)", (5.10) 
which is an improper distribution. 
Remark 5.21. Jeffreys’ prior depends on Fisher’s information and, therefore, depends on the chosen 
sampling model. Thus, for example, if our sampling model is binomial, the Jeffreys’ prior for 0 is 


2(@) = (@(1—6)) *, whereas, when the sampling model is negative binomial, the prior is 


g(0) = Q" d- 6)” . This feature amounts to violation of likelihood principle. 

Remark 5.22. The phrase ‘knowing little’ has a meaning relative to a specific experiment. For two 
different experiments, each of which can throw light on the same parameter, the choice of non- 
informative prior can be different. 

Example 5.5. Suppose that 0 is the chance of success in a Bernoulli trial. There are two possible 
experiments which can be conducted to draw inferences about parameter 8. We can either observe the 
number of successes k in n trials or else the number of trials N needed to give m successes. Suppose 
we decide to conduct the first experiment in which we toss a coin twice and observe one success, i.e., 
k = 1 and n = 2. Then the Jeffreys’ prior for 0 is g(@)<(0(1—6))"'” giving posterior density 
g(O|k)<(8(1—-8))'”. On the other hand, if we conduct the second experiment in which we go on tossing 
a coin until the first success is observed and suppose this requires two trials, 1.e., m = 1 and N = 2. 
The Jeffreys’ prior for 6 is g,(@)« @'(1-)"'” giving posterior density g,(6|k)« (1-6)'” since the 
likelihood function is same in both the experiments. However, the posterior inference will be different. 
For example, posterior probability that 6 = 1/2, in the first case is 1/2 and in the second case is 


| (1—6)'d0 = 0.35. 


1/2 


B(1,3/2) 


This is disturbing because it violates the likelihood principle. 
Example 5.6. Let X ~ U(0, 9). The Jeffreys’ prior for 0 


2(0) = 106) 


a 
where 1(8) =—-E [i f(x | ®) : 
00 


-l 1 
=-—E 5 = T-: 
a 0 


Hence the Jeffreys’ prior for 6 is 


(8) 
BW) ~~, 
0 
It is interesting to recall that the conjugate prior for @ is Pareto distribution with density function 


g(8) = 00°0" 'T, _ (8). 


(a, 0) ( 


The unnormalized Pareto distribution g(8) « e""] (8) tends to Jeffreys’ prior as a and 8, both 


(,.°°) 


tend to zero. In this case, we may call it Jeffreys’ nil-prior for 9. 
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Example 5.7. If X ~ Pois(®), then Jeffreys’ thumb rule gives prior for 8, g(®) « 1/0, since 8 € (0, ©). 
In the conjugate gamma prior for 9, if we let hyperparameters tend to zero, we get g(@) « 1/0, which 
is once again Jeffreys’ nil-prior. However, using the formal procedure to obtain Jeffreys’ prior for 9, 


we obtain 9(0) «1/0. which is different from the one obtained by Jeffreys’ thumb rule. The 


discrepancy occurs because @ is the mean as well as the variance for the Poisson distribution. 
In practice, the Bayesian inference based on either of the two priors will not be too different as 


n n 1 
the posterior density of @ will be either Gana J x or Gamma [x X, aa The 
2 


i=l i=l 


difference between these two posterior densities is negligible for large sample sizes. 
Example 5.8. Let X follow Weibull distribution having parameters p and 0 with pdf 


Dos x! 
f(x |9,p)=—x" exp} -— |, x,p,9>0. 
0 0 


Case 1. If p and 9 are assumed to be a-priori independent then 


g(p, 8) = g, (p)g, (8), 
Since their range is 0 to 9, using Jeffreys’ thumb rule, 


g,(p)<1/p and g, (0) «1/98. 
Hence g(p, 9) = 1/(p8). 
Case 2. In case, p and @ are not assumed a-priori independent, we have 


a a 
E| — log (8,p|x) | E log £(8, p | x) 
ap’ dpae 
I(p,8) =- 


a a 
E log ¢(8, p | x) E| ——log £(9, p| x) 
So g £(8,p | ) = g £(8,p | 


2 P 


1 x 5 
Since: ——— log £==—5- = — (oe x)", 


p p 8 
2 x? 
log € = —logx, 
dpoe e 
a 1 2x? 
—log f = — = 
00° > @ 


and using the expressions for digamma and trigamma functions, 1.e., 


co 


W(x) = | e“u’” logudu 


0 
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td d T , <7 2 
and W(x) = — (x) = | ue (logu)'du, 
dx 


0 


we have 


ov 1 © 
E| —— log? |= —| u(logu + log 8)e “du, where u = x’ /0 
dpde Op *, , 


= (log 6 + ‘¥(2)), 
8p 


0 


a” = ee ; 
E —log é |= = u(log u + log oeva 
dp Pp 


= an + (2) + 2W(2) log 0 + (log 6)’ iF 
p 


and 


Jeffreys’ prior for (p, 8) is 
g(p, 8) x | I(p, 8) | 
-1 
— (‘¥(2) + log 8) 
Op 


1 
e 


= 1 : 
55 LO +088) —[1+ (2) + 2 (2)log 6 + (log 6)° | 
p p 


Z Gal +8(2)- vo] 
Op 


1 
Hence, g(p,8) <« —, 0,p>0. 
8p 
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Example 5.9. Let X ~ N(u, 0°). 
Case 1. Jeffreys prior for o when Ll is known (say Lt = 0) is 


a(6) « ./I(o) 


where 


a 1 3x’ 1 
I(6) = -E] —log f(x | o) |= -E] =-— |« — 
[= c o o 


Hence g(6) « I/o. 
Case 2. Jeffreys prior for 6 when yp is also unknown is 


g(8) < | 1(8) |, @ = (u,0), 


1 2(x - 1 
= ( i) a 
1@)=E} ° a mi 
where 2%-p) 3@-p)y 1 § 
o oO o o 
| 1(0)| = 2/o*. 
Hence g(M, 6) x 1/07. (5.11) 


However, if u and o are unknown but a-priori independent, Jeffreys prior for (u, 6) is g(u, 6) x I/o. 
Remark 5.23. Box and Tiao (1973) remarked that the extra factor in (5.11) arises due to ignoring 
independence of pt and o. 

In the Jeffreys prior for the joint prior distribution for (u, ©) of a normal distribution, an objectivist 
will impose prior constraint, like a-priori independence of u and 6, in order to be consistent or coherent. 
However, the difficulty is that it is not clear under what circumstances independence should be 


imposed. Villegas (1977, 1981) prefers the prior g(U, o ) «6 instead of g(U,o ) «Oo. 

Example 5.10. (Example 5.9 continued) Let X = (X,,X,,...,X,) be a random sample from N(u, 6°). 
If u = 0, the Jeffreys prior for 6 is g(0) « 1/o. However, if u is unknown then the Jeffreys’ prior for u 
and o is g(u, 6) « 1/6. If we consider the posterior distribution of 6, given x, in these two cases 


then 


1 n+] y - 
g(o| x) « ) exp| ——— |, when p=0, 
: o 


a fo 
which would be such that y xf O =%,. 


i=l 
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However, if 4 is unknown then 


a(o|x)e | gQof(x|Ho)du 
f 1 . 2 2 
cond | = | Y(x, LL) js js 
-(2] oo( 3 Le —x)y jos call [> a “fe wea ie 
= A 


; 2 |» do sic =e 
Letting YG, = x [os =, wehave —= o'/ So —x)y 
do i=l 


i=l 


and 
e(dlx) eo? oo 2} 


which would be such that 6 = Y«, = oy [o ~ cae 


i=l 


This result is not acceptable since we do not lose any degrees of freedom even though the knowledge 
that u = 0 is ignored. 

Example 5.11. Let X ~ N(0, 6”) and assume that the coefficient of variation is a known constant 
c = 1. Since coefficient of variation is 6/8 = 1, we can take X ~ N(@, 6°). This form of the model is 
known as multiplicative model (see Robert, 2001). The Jeffreys’ prior for 6 is given by 


1 
g(0) « V1(0) = ——, OE (-e,%) 
| @| 


7 4 4x 3(x-6) 1 
since E] ——log £(8| x) |=E|] — : ri Ses 
00° 6 80 ts) oe 


Remark 5.24. Jeffreys’ non-informative invariant prior for (8,r) of a k-variate normal distribution is 


g(8,r) « |det(r) (5.12) 


under the assumption that 8 and r are a-priori independent. It follows from the fact that the Jeffreys’ 
non-informative prior for the unknown precision matrix r is 


g(r) « «/ det I(r) , 


—(k+1) 


where det I(r)| a |\det(r)| 


Geisser (1965) observed that it would also result from taking conjugate prior pdf on r in the Wishart 
pdf form and allowing the df in the prior pdf to be zero. 
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Example 5.12. Suppose f(x | 8) belongs to the regular one-parameter exponential family of distribution 
with 

f(x|6) = exp(A(8)B(x) + C(x) + D(6)) 
then Fisher’s information 


1(0) =-E [sree £(8 | x) } —D’(0) — A’(®)E(B(x)). 


However, differentiating i f(x | 6)dx =1, with respect to 6, we have 


E(B(x)) = —D’(8)/A (8) 


, 


Therefore, 1(0) = A’(0) oa D’(6). 
A‘(@) 


, 


Hence, the Jeffreys’ prior for 0 is 


, 


ei WOO) 
g(0)«< jA Pe (9). (5.13) 


, 


In particular, for Poisson distribution A(®) = log@ and D(8) = —0, we have, 


2(0) « 1/ Je 
and for binomial distribution A(8) = log(@/(1—9)) and D(®) = log(1—8), we have, 


g(8) < 1/,/6(1—8). 
5.5 ASYMPTOTICALLY LOCALLY INVARIANT PRIORS 


J. Hartigan (1964) developed another method for determining non-informative prior distributions 
using invariance techniques similar to those suggested by Jeffreys (1946). He called this new class of 
priors “asymptotically locally invariant (ALD priors”. These priors are locally invariant in the sense that 


they are invariant in the neighbourhood of some 6, € ©, and are asymptotic because the asymptotic 


> 


re) 
distribution of the variance of a f(x | 6) and log f(x | ®) are determined upto the order of 


2 


Od/ Jn ) by the first and second moments of these variables. 


Construction of ALI Priors 


Suppose f(x | 9) is a family of pdfs with 8=(6,,9,,...,0,). Let us denote log ((8| x) = L(®) 


and let a, be the (i, j) th element of the inverse of the matrix B with (i, j)th element 
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Qo 
b, =E ori | i, j=1,2,....k. 


The ALI prior density g(9) is given by the solution of the equations 


P 
—| 6 E| » ———L(0 —L(0 p=1,2.,...,k 5.14 
59 es) = 22 ie of 2 off (5.14) 


Pp 


provided each solution exists and the following conditions hold: 


(i) “| 210) = 0; i=1,2,...,k. (5.15) 
a0, 


and 


0 0 a 
ii E| »—L(6 — L(6 +E) ——L(8)/=0, i, j=l,...k. 5.16 
. {3 ols | Es 0 - = 


Example 5.13. Suppose the rv X follows Binomial distribution with unknown parameter @ so that 


n 
f(x | 6) = 6*(1-6)'"*; x=0,1...,n 
x 
ee Ee) on) |e 
Since 30 @(1 — 8) 
a x n-X —n 
E| —-L(®) |=-E . — |= 
G 5 @(1— 0) 
and 


r) a 1 : —n(1— 20) 
E|| —L@) aoe ee eee }] = ———_.. 
00 00° 6'(1—6)° 6° (1-8) 
Conditions (i) and (ii), for k = 1, are satisfied. The ALI prior for @ is given by the solution of the 
equation, 
) a 
-E]| —L(6) | —-L@) 
00 00° 1 1 
a 1-@ 0 
00° 


2 log g(8) 
{108 88) = 
00 
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Thus the ALI prior for 0 is 


(0) « (01-8). 
Remark 5.25. Note that this non-informative prior is the same as the one suggested by Haldane (1931) 
and advocated by Novick (1965). 
Example 5.14. Suppose we are interested in obtaining joint ALI prior for the unknown mean pt and 


standard deviation 6 of a normal distribution. 
Here @=(u, 6) and k = 2. 


1 1 5 
Since L(u, 6) = log 2m —- logo —(X—), 
2 20° 


X—M 


2 


0 a 1 1 ‘ 
=~ LIM, 6) = , UL OSes 
ou 00 o Oo 


ei 


0 —2 
and ———Li,0)=—(x-W). 
ie 


OLOG 
The conditions (i) and (ii) are satisfied. Furthermore, it is easy to see that 
-1/0° 0 (oe 0 
B= and Bo = 
0 -2/0° 0 -o' /2 


Hartigan’s ALI prior for (u, 6) is given by the solution of the differential equations 


a ( a tl , ( a \z (x —p)’ le 
— log g(u,o) =E > Liu, 6) || —— ||o +E] | ——Li,0) + : 
ou ou o dodu oO oO 2 
-2[0-w]+[(So » (2 I 
fo oO 2 206 


which vanishes. Hence g(u, 6) is a function of o independent of uw. The other differential equation is 


f) O° -1 1 + Wee a x-u\|o 
— log g(u,o) =E > L(u,o) | —+—&-pw ||o +E}) ——_Li,0) 5 = 
do do o Oo OLdG Oo 2 


1 
Thus, we have g(U,o) « = 
oO 


Remark 5.26. The posterior distribution of 6, given X,, X XxX, is proportional to 


a2 tte 


(oy 


i=l 


n+4 
1 -S ‘ 
(-] on( = fo with ALI prior g(u, 6) « 1/o°, where § = Vo, —x) /2. Thus, the marginal 
= 
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2 


n+3? 


posterior distribution of o is such that Yiu, -x)/o ~ x 


i=l 


which is contrary to the well-known 


result that Vx, -x) /0 ~X%.,. 


i=1 


Aliter. Consider a transformation 6 = VG, —x) /o , then 


i=l 


n+3 


a(x) <6? exp(—9/2) 


has a kernel of a ¥?-distribution with (n + 3) df. 
Such contradictions have resulted in restricted use of ALI priors in the multiparameter problems. 


5.6 MAXIMUM ENTROPY (MaxEnt) PRIORS 


Different probability distributions have different uncertainties associated with them. The 
uncertainty associated with the probability of outcomes is called probabilistic uncertainty or 
information theoretic entropy. It is important to realize that the information theoretic entropy is different 
from experimental entropy of thermodynamics used in physics. We should make note of the fact that 
experimental entropy makes no reference to any probability distribution, whereas, the information 
entropy makes no reference to thermodynamics. 

The word “entropy” suggests transformation from order to disorder. Uncertainty is reduced by 
obtaining more and more information. For example, if a die is not shown to us before rolling it, we have 
a great deal of uncertainty as we do not even know the number of faces it has. We may have many 
probability distributions p = {p,, p,, -.-, p,} where n itself is arbitrary. However, if we are told that the 


6 
die has six faces, we have n = 6 and p,, p,, .... p, 20 and yp, = 1. The set defining p is further 
i=l 
constrained if we know that the mean number of points on the die is 4.5. The uncertainty will go on 
decreasing as we get more and more information about p. 
The principle of maximum entropy chooses a probability distribution, consistent with the given 
set of constraints, one that has maximum uncertainty. 
Definition 5.1. (Claude Shannon (1948)) 
Entropy (or uncertainty) of a probability distribution p = (p,, p,, ..-, P,) 18 


e,(p)=—) p, logp,. (5.17) 


i=l 
Remark 5.27. € (p) as a measure of uncertainty is known as Shannon’s Entropy. 
Remark 5.28. The most rational unbiased choice of p,, p,,....P, iS P= P,=-.-= Pp,=l/n subject to the 


constraints p, 2 0; i = 1, 2, ..., n and yp, =1. According to Jaynes (2003, page 351), Shannon’s 


i=l 
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entropy €,(p) has acquired a new meaning as a fundamental measure of how close is a probability 
distribution to a uniform distribution. We may also note that this distribution was suggested by Laplace 
to represent uncertainty in the prior distribution and called it principle of insufficient reason. 


Properties of Shannon’s Entropy 


(1) €,(p) is a continuous function of p,, p,, -.. p, for all p, (i = 1, 2, ...n) lying between 0 and 1. 


Note that limp, logp, = 0 and if we define 0. log 0 = 0 then each p, log p, is a continuous 
p,0 


function of p, on [0, 1] 

(2)  €(p) does not change by the inclusion of an impossible event, since, 
€,41(Py> Py» +» Py» 0) = €,(p) — 0 log 0 = €,(p). 

(3)  €(p) 20. Since, -plog p, 20 for O<p,< 1, G=1,....n). 

(4) For a degenerate distribution € (p) = 0. Since, for a degenerate distribution, the outcome is 
known with certainty. 


1 1 1 
(5) The maximum value of € (p) occurs for the discrete uniform distribution p = {*. = fects +} which 
non n 
is equal to log n. 
Proof. Let us use Lagrange’s method to maximize entropy. Consider the Lagrangian 
L=-)) p, bgp, (E P; “1 
isl i=l 
where A is Lagrange’s multiplier. 
Differentiating with respect to p,, p,, ..., p, and equating each equal to zero, we get 
1+logp,+A=0, i=1,2,..,n. 
which gives p, =p, = ...= p, = I/n. Thus the probability distribution p is discrete uniform. Therefore, 


— (p)= -y Es ioe = logn. 
is. 1 n 

Remark 5.29. Note that log n is a monotonically increasing function of n and tends to © as n tends 
to co, Thus the maximum of entropy will increase as the number of possible outcomes increases thereby 
increasing uncertainty. 

Definition 5.2. Kullback-Leibler measure of distance between two probability distributions 
p= {P,P +» p,} and q = {q,, q,, -... q,} is defined as 


; ; 
D(p.q)= )) p,log—. (5.18) 


isl i 


Dip.g=)_, p, log Ge 
1/n 


i=l 
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=). p, log p, +logn = logn—€ (p). 
i=l 
In the absence of specified q, according to Jaynes’ maximum entropy principle, we should choose p 
so as to maximise Shannon’s entropy subject to p satisfying all given constraints. 
Example 5.15. Suppose we wish to compare two Poisson distributions, p with parameter 0 = | and q, 
with parameter 0 = 2. The Kullback-Leibler distance between p and q, is 


-l -l 2 vi 
“ e e e 2 

D(p.q,) =) =e / - 
oe ad i! i! 


= | — log.2 = 0.30685. 
However, if q, is Pois(3), then 
D@, q,) = 2 — log,3 = 0.90139. 
Thus, Pois(3) is much different from Pois(1) as compared to Pois(2) from Pois(1). 


1 1 111 
Example 5.16. Suppose P = . =o } and q= f. =3 1 then 
2 2 3 3 3 


D(p, q) = log, 3—log, 2 = 0.40546. 


111 1 1 
Thus, the distribution {t. =. ‘| is less informative than f. =, +} 
3 3 3 2 2 


MaxEnt Prior Distribution for Discrete Random Parameter 


Suppose the random parameter © takes n discrete values 0,, 8,, ..., 0, with corresponding 
probabilities Py Py» +> Py: Denote h(0,) = h,, i=1,2,...,n;r= 1, 2, ..., m. In order to obtain MaxEnt 
prior distribution p = {p,, p,, .-.. P,}, We Maximize € (p) subject to the constraints 


Y p=! 
i=1 


and 


E(h,(6)) =)” p,h, =a,3r =1,2,....m (5.19) 


i=l 
for p,20; i=1,2,..,n. 
Consider the Lagrangian 


L=-); P, jen, -0,-0( nahh ap pia, } (5.20) 


where A,, A,, .... A,, are the Lagrangian multipliers. On differentiating L with respect to p, and equating 
it to zero, we have 
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P, =e] -2.-E an, i=1,2,..,n. 


jel 


The multipliers are determined by substituting for p, in the given constraints. In particular, y pal 


i=1 


gives 


exp(A,) = y oof -Z Ah, ) (5.21) 


and E(h,(0)) = a, gives 


a_exp(A,) = y h, oo ¥ Ah, } r=1,2,...,m. 
i=l jel 


Thus, 


where A,, A,, ..., 4,, may be determined by solving the above (m+1) equations. 
Thus MaxEnt prior distribution for 8 is given by 


poon(2-¥ an, i=1,2,..,n. (5.22) 


Furthermore, > p, =1 gives 


i=l 


y oo(-Z A, (5.23) 


Example 5.17. Suppose © takes values 1, 2, ..n and let the corresponding probabilities be denoted by 
P,P, --» P,- Determine the maximum entropy prior distribution when E(®) = m. 
Solution. The Lagrangian is 


L=-y peer. -4(E n-fa nm 
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; oL ; : ; 
Since ap =0 gives —1-logp, -A,-A,i=0; i=1,2,...,n, 
orp, =ab’, where a= ee" and b=e™, i=1,2,....n. 


The constants a and b may be determined as follows: 
The constraints 


y p, =1 and y ip, =m, 


ay b' =1 and ay ib' =m. 


Thus 


y ib’ =m) b, 
i=l i=l 


or the equation 


n 


f(b) =)’ (i-m)b' =0 (5.24) 
i=l 
may be used to determine b. 
Note that the coefficient (i-m) of b' is an increasing function of i. The coefficient (1-m)<0 of b 
and the coefficient (n—m)= 0 of b" in (5.24) since the mean m of the first natural numbers is between | 
and n. According to Descartes rule of signs, f(b) = 0 has only one positive root. 


Furthermore f(0) = 0, f(1) = (CF -» | and f(cc) > 0. Thus, the positive root is less than 
unity if m < (n + 1)/2 and is greater than unity if m > (n + 1)/2. 

Therefore, the MaxEnt prior for 8 is such that 

p,=P(@=0)= ab! , i= 1,2,....n, 
which is in geometric progression. 

Shannon’s entropy can be extended to the case of discrete random parameter © taking a 
countably infinite set of discrete values, by letting n oo. 
Example 5.18. Suppose © takes values 0, 1, 2, ... and we are given that E(@) = 2. Here, m = 1, h(@) = 
0, a = 2. Therefore, maximum entropy distribution is given by 


ae 
ge eae, i=0,1,2,.... 
y exp(—A. i) 


i=l 


Writing 1 =—A,, we get 
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= (1—e*)\(e*)' : i= 1, 2,..., 0. 


Thus the MaxEnt prior for 6 is geometric with probabiltiy of success e*. Since the mean of the 
distribution is given to be two, we have (1-e”)/e* = 2. Hence the MaxEnt pmf is geometric with 
probability 1/3 of success. 


MaxEnt Prior Distribution for a Continuous Random Parameter 


Suppose g(Q) is a prior density function over the interval (a, b). The analogue of Shannon’s 
entropy of g(®) may be taken as 


b 


€,(g) = -| g(6) log g(0)d8. 


The idea is similar to the one in which sums are replaced by integrals when we shift from discrete to 


the continuous case. We may extend this definition for the random parameters having infinite range 
as well. 


Result 5.2. Suppose g(9) is a prior defined over the interval (a, b). The MaxEnt prior distribution is 
obtained by maximising €,(g) subject to the constraints 


b 
| g(9)d0 =1 and E(h,(8))=a,;  r=1,2,...,m. 


a 


The MaxEnt prior is given by 


2(0) =exp -, -)" Ah, @}, (5.25) 


i=l 


where A,, A,, .... A, are the Lagrange’s multipliers which may be determined by using the 
(m+1) equations 


b 


exp(A,) = i oxo|-) in, @) [a 


and 


b 


a, exp(d,) = | h @exn| -D (| o6:1= 1.2.0, 


Example 5.19. Suppose © has a continuous prior g(8) defined over the interval (a, b) where a and b 
both are finite. The MaxEnt prior distribution subject to the constraint 


b 
| g(8)d0 =1 is 2(0)=e”, where A, is given by 


: 1 
| e*dd=1>e” = 
, b-a 
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1 
Hence, g(0)= ; , a<O0<b, 


-a 
which is a uniform distribution defined over the range (a, b). 

Thus, U(a,b) pdf is the most non-informative continuous prior distribution defined over the 
interval (a, b). 
Example 5.20. Suppose the rv © is defined over the whole real line and E(@) = 0 and E(@’) = 1. The 
MaxEnt distribution for © is 

(8) = exp[-A,-A,0-A,67]. 
=a exp {-b(0-c)*} 


(40,4, 1, rn 
a= exp Sh? »b=4,c=-— 
4n 2X 


2 2 


where 


The constants a, b, and c are given by using the three constraints. Thus, 


T ; 2n 
| g(6)d0=1 gives a j— =1 
ae 2b 


The constraint E(@) = 0 gives al 8g(8)(—b(8 —c)’)d® = 0. 


= 


Substituting y = 0-c and using the property of odd and even functions for integrals, we have 


c = 0 and E(0’) = | gives al 8° exp(—b(@—c)’)d®@=1. Using gamma integrals, we have 


a= 2bVb/ Jn. 


Hence, b = 1/2, a=1/V2n, c=0. 
Therefore, the MaxEnt distribution of © is 


1 -9° 
g(8) = ——exp| —— |. 
V2 2 


This is a well known standard normal distribution. 

In general, if E(Q) = up and Var(®) = 0”, then the MaxEnt distribution of © works out to be 
N(u, 0°). 
Remark 5.30. In the Bayesian context, a proper non-informative prior distribution of the unknown 
parameter with prescribed first and second moments and nothing else, uniquely identifies the prior 
distribution to be the normal distribution. 


Remark 5.31. The form of the MaxEnt distribution g(0) = exp [-s - i,h, (8) } determined by 


i=1 


the maximum entropy principle, belongs to the family of exponential distributions. 
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Example 5.21. Suppose we are interested in determining MaxEnt distribution defined over the whole 
real line, such that, 


°c 


| g(0)d0 =1 


—co 


and E(log(1+6’)) = k. 
The MaxEnt distribution is 


g(0)= 


e 
(+0 )" 


do 


2 2 


—co 


T 1 1 
In order to determine i, and dK. we consider | g(8)d6 =1 to obtain sal --,- }: 1, so 


1 1 
that, A, =os8( 35) and A, may be determined by solving the equation 
22 


| log(l+ 8° )g(0)d@ =k. Hence, the MaxEnt distribution of © is 


oo 


1 1 1 
g(8) = 11 1 9 ? aa 
B el ae] 
2 2 
In particular, if k is such that r, = 1, then 
(9) 

BY=—. 5 
m1+0 


which is the Cauchy distribution. 

Therefore, Cauchy distribution is a MaxEnt prior distribution which is determined by the constraint 
E(log(1+@7)) is equal to a given constant. 
Remark 5.32. In the Bayesian context, maximum entropy principle is often used to obtain partially 
informative prior distributions. Jaynes (1968) defined entropy of a continuous prior pdf as 


9 
e,(g)=-] g(@)log £©) a9 (5.26) 
g, (8) 


0 


where g,(9) is a natural invariant non-informative prior. Most of the time Bayesians prefer to use g, 


as Jeffreys’ non-informative prior. The maximum entropy prior distribution is found to be 


g,(0) oof 2m.) 


i=l 


| g, (8) exp [x ih, (8) je 


i=l 


g(8) = 65.27) 
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where the Lagrangian multipliers Xr. Kes 208 i, are obtained from the constraints E(h(6)) =U, yo ewe 
Note that non-informative priors are author dependent and, therefore, there is arbitrariness in Jaynes’ 
definition of MaxEnt priors. 

Example 5.22. If © belongs to a real line and the non-informative prior for 0 is g,(8) = 1, the maximum 
entropy prior is N({1, 6”), when the mean of the prior distribution is given to be p and variance is 67. 
Remark 5.33. It may happen that the MaxEnt prior does not exist if the parameter space is unbounded 
and the restrictions specify fractiles of the prior distribution. 

Example 5.23. Suppose the parameter space is the whole real line and E(@) = c. The maximum entropy 
prior with g,(0) = 1 is given by 


g(8) = exp(A8) | exp(A6)d0. 


Since, the integral in the denominator does not exist, the MaxEnt prior cannot be a proper density for 
any A. 

Remark 5.34. One of the serious objections raised by Seidenfeld (1987) is that the maximum entropy 
distribution suffers from partitioning paradox. It may be noted that even the principle of insufficient 
reason suffers from this paradox. (See Kass and Wasserman (1996)). 

Some researchers have also criticized MaxEnt procedures on the grounds that they lack ‘order 
invariance’. According to Zellner (1998), MaxEnt procedures are, infact, order invariant when the same 
side conditions are throughout employed. Most of the time, the critics inadvertently change the number 
and/or nature of the side conditions to show lack of invariance (See Zellner, 2004). In particular, Zellner 
(1998) finds that in the two examples Kass and Wasserman (1996) (citing Seidenfeld, 1987) change the 
conditioning events or change the constraints to conclude lack of invariance to the order in which the 
data sets are analysed. 


Example 5.24. Suppose we know that P(@ < 0.25) =0.3 and P(0.25< 0 <0.5) =0.4. We wish to find 
the MaxEnt prior distribution g(@) for 8, 8€(0, 1), subject to the above two conditions and 


1 


| g(@)d0=1. 


0 


Solution. The MaxEnt prior density of 0 is 
(9) exp [A Lon2s (8) + i, (0.25,0.5) (8) |. 


where A, and A, are Lagrange’s multipliers. 
In order to obtain i, and ye we have from the above two conditions 


0.25 


0.3 = PO S 0.25) = ) Cexp Le (8) + LS een (8) ]a@ 
=¢ i e" dO = c(0.25)e™ (5.28) 
and 0.4 =P(0.25< 0< 0.5)= | cexp[A,I 935, (8) + Aalinass;(8)] 40 


0.25 
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0.5 
=c | e” d® = c(0.25)e™, (5.29) 
0.25 


Further 


1 


J cexp[Asloas,(O) + Axl 3505, (0) ]d0 = 1 (530) 


0 


Solving (5.28), (5.29) and (5.30), we get 
8 3 
X, =log2, A, =log—, and c=-. 
3 5 


Hence, the Maxent prior distribution of 0 is 


3 8 
g(8) = e os 21 aaa) (8) + log 3 J asa | . 


It can be easily verified that g(8) is represented by a histogram with sub-intervals [0, 0.25), [0.25, 0.5) 
and [0.5, 1]. 

Remark 5.35. In case the constraints are given in terms of quantiles of g(8) we may still use the above 
technique to obtain maximum entropy distribution for 0. For example, if a,, a,, ..., a, are the p,, p,,-.-. Py 
quantiles, respectively, of g(®) then it amounts to P(a, <@<a,,,)=P(®<a,,,)—P(@<a,) =P,,, — P; 
i=O, 1, 2, ..., k, defined over the interval (a); a,,,)- 

Remark 5.36. In case the range of the @ is (—°°, 0) or (0, c°), one may encounter difficulty in obtaining 
maximum entropy density as a proper density. This difficulty may be overcome if we use a truncated 
parameter space whenever maximum entropy distribution does not exist for the original parameter space. 
Remark 5.37. According to James Berger (1980), choosing moments restrictions is analytically easy 
but is generally inferior to the use of quantile restrictions from the view point of robustness. In fact, 
it is safe to use first few moments as restrictions. However, the use of higher moments is generally bad. 
It is generally much easier to specify prior quantiles than to specify prior moments. 


5.7 MAXIMAL DATA INFORMATION PRIORS (MDIP) 


It is often desirable to have posterior distribution that reflects mainly the information in a given 
sample. It is necessary to obtain a prior distribution that adds no information to the sample information 
to achieve this objective. The basic idea underlying maximal data informative priors (MDIP), introduced 
by Zellner (1977), is to provide maximal prior average data information relative to the information in the 
prior distribution. Here information is Shannon’s entropy (or uncertainty measure) given by 


H(f(x)) = -| f(x) log f(x)dx, where f(x) is the probability density function. Zellner used negative of 


H(f(x)) as the measure of the information associated with f(x). 

Let f(x|®) be a proper pdf (or pmf) and defined to be positive for all admissible values of X and 
let 6 be a scalar parameter such that 0 € [a, b] with a and b finite. 
Result 5.3. The information in the joint pdf f(x, 8) is the sum of prior average information in the data 
and the information in the prior pdf. 
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Proof. Information in f(x, 8) is 


-H(f(x,6)) = | i f (x, ®) log f(x, 0)dxd@ 


x 


“| | 2(0)f (x | 6) [log g(0) + log f(x | ®) ]dxd® 

= J B(0)[F¢x | og (x | @)]éxa0+ | | f(x |) [g(@) log g(6)}dxd@ 
“ | 2(6)1(0)d0 + | F(x 1d] 2(8) log ¢(0)d0 

= | g(6)1(0)d0+ J g(6) log g(8)d6 


where I(@) = A) f (x | 8) log f (x | 6)dx (5.31) 


x 


is the negative of Shannon’s measure of information in f(x|@) and | g(8)1(@)d6 is the data information 


averaged over values of 8 with g(@) prior. 
Zellner (1996) justifies that the criterion functional 
D(g(8)) = Prior average information in the data density minus the information in the prior density. 
(5.32) 
satisfies a scientist who wishes to emphasize the information in the data density. 
Definition 5.3. Maximal data information prior pdf is a proper, normalised prior pdf that maximizes 


b 


D=] 1@)g(®)d0-| (6) log g(@)d6, (533) 


where I(8) is the information in the pdf f(x|®) given by (5.31) and 

b 

| 1(0)g(8)d0 = E(1(8)) 
is the prior average data information since the average of data information I(®) is taken with respect 
to the prior distribution g(8). 


b 
Euler-Lagrange Equation. Let I = [FOE (x), f’(x))dx , where F is a known function, then the function 


a 


f(x) which maximizes or minimizes I is given by Euler-Lagrange equation 


oF d{ OF iy 
of(x) dx|\ of (x) 
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In particular, if the function F does not involve f(x) then Euler-Lagrange equation reduces to 


OF 


of(x) 


1 1 7 
Example 5.25. Suppose f(x | 9) = exp (x-6) |. 
Pp Pp Jon > 


So, 


(0) = i = oo : (x —0)° ~ log 20 : (x —0)° je 
es TU 


1 1 
= ——log 2m—— Var(x | 0) = -(1+ log 2m) /2 
2 2 


which is independent of 9. 


Thus | 1(6)g(8)d6 = —(1 + log 2%) / 2. 


—co 


D will be maximized if i g(8) log g(8)dO is minimized such that | g(0)d0 = 1. 


co 


Lagrange’s multiplier method requires the function 


—M -M 


be] ronan | xow-1] 


i 1 
= | {2 (log g(8) + 4)— ~ hoo, M very large. 
2M 


-M 


Euler-Lagrange equation a = (0 may be used to obtain optimal prior g(8) which maximizes D, where 


dg 


1 
F = g(6) (log g(6) +A) - — 
2M 


and 


0. 


OF : 1 
— =0 gives (log g(6)+4)+g(6)| —— 
og g(8) 


or  logg(®)=-A-1 
Thus g(8) = e-**” = constant (independent of x). Hence the minimal informative (or maximal data 
information) prior is 
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g'(8) =constant, Oe (—M, M). 


x’ 
Example 5.26. Let f(x | o) = exp , o>0. Then 
ov 20 


I(6) = [ f(x |o)log f(x | o)dx 


-1 T 1 -x’ 1 * x -x’ 
=| — log 21—- logo | exp dx ; i exp dx 
2 , ov2n 2 20°, oV2n 2 


-1 
= — log 2n-logo-1= (log 2n+1)-log Oo. 
2 2 


Thus, for proper g(0), 


[ I(6)g(o)do = — (log 2n+1)- i log o g(0)do 


also 


D+ =a + log 2m) — i logo g(c)do — fe 2(o) log g(o)do. 


Consider, 


L=A (j" g(o)do —1 - I log 6 g(o)do — ihe g(0) log g(o)do, M very large. 


r 
Writing F=g(o){A—log 6 — log g(a) }- 7 


OL 
ora gives 1 —logo-—1-log g(o) = 0. 
& 


Hence g(o) = — 


M 
where A is to be obtained by using the constraint | g(o)do =1 for finite range (0, M) with M large 


0 
enough. Thus MDIP for o is g(6) « I/o. 
Example 5.27. Let f(x|6, o) be N(0, 0”). Then 
1(0,6) = | f(x | 0,6) log f(x | @,0)dx 
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reel i ; 
= | log 2m — logo -——(x 6)’ f(x | @,o)dx 
iis 2 20° 


1 -l 
= — log 2n-logo = — (1+ log 2m) —- logo 
2 2 2 


1 — 
(1+ log 2m) — i} [log og(0, 6)d0do 


Hence ||” 1(6,0)g(0,0)d8do = = 


; oe oe 
So, D=—-—(1+ log 2m) - i | log og(8,0)d8do - [ | g(0,6) log g(0, 6)d0do. 
2 


0 —0o 0 00 


Lagrangian expression is 


L= || | {log og(8, 6) + (0,6) log g(0,6) — A2@,0)}00 —A, M very large 


0 -M 
a 
Writing F = (0,6) {log o + log (0,6) —A}- —,,~ Euler-Lagrange equation 
2M 


OF 
0g(8, 0) 


= logo + log g(8,0) +1-A =0 


gives 


log(og(8,6)) =A-1 


Toa 
or g(9, 0) =—e i, 
oO 


M M 
for g(8, 6) to be proper pdf, | | g(9,0)d6do =1, m small, 


m —-M 


so. 1=2M exp(A-1) (logM — log m) 


1 1 . M 
exp(A —1) = - so g(0,0)« —, whereM =—. 
270 log M oO m 
We have not assumed independence of 8 and o but the prior of (6, 6) is proper. 
Result 5.4. The normalised MDIP pdf g'(6) maximising D is given by 
g'(8) =c exp(I(6)); 9 < [a, b], (5.34) 
where c is a normalising constant. 


b 
Proof. Our aim is to maximise D, such that, | g(6)d6@ = 1. Construct Lagrangian function as 


a 
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| 2(0)1(0)d0- | soresstouo+a | sono) 


a a 


i r 
= | Gao — g(8) log g(8) + Ag(®) — : je 
‘ —a 


On using Euler-Lagrange equation 


oe 1 
a1) rog200 +9 80- J =6 
dg(8) — 


or — 1(8)—(logg" (0) +1) +4=0, 


we have 


g (8) =cexp(1(®)), 
where c =exp(A—1) and for g°(6) to be proper 


¢ | exp(I(0))d0 = 1 


a 


Thus, g (6) js Ge [a, b]. 


| exp(1(@))d® 


a 


Corollary 5.1. If I(6) is a constant (independent of 0), then 


: 1 
g (0)= , GOel[a,b]. 
b-a 


Example 5.28. Let X ~ Bernoulli(®), then 


1(0)=)) f(x | ®)log f(x | ®) 


= @ log 0 + (1— 6) log(1— 6) 
Since f(x | 6) = 0*(1- 6)"; x =0,1; 0 [0,1]. 


The MDIP pdf is g (0) =c(0°(1—8)°); @€ [0,1], 
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(5.35) 


where c = 1.61856 is obtained by numerical integration. Note that it is symmetric about 8 = 1/2, which 


is its mean as well. It is a proper density function because lim" =1 and lim(1-6)"* =1, 
e-1 
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This prior has finite moments of all orders. Expanding g(8) around 6 = 1/2 in a Taylor’s series, we have 
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: 5(1 1) 4 iy 
g (= +10 +—| 0 ; 
3|2 x) 3 2 


which is easy to work with using numerical methods. 
Example 5.29, Let X ~ N(0, 0°), o? known, 


1 * -l 1 -l 
h 1(0) = —— exp ae — 6)’ | exp — (x 6)’ Je 
— ov 20 ! e ov 20 & 


1 l.- « 
= log — Oo 
oVv2n 20° 


1 1 
= log --=c 
ov2n }) 2 


and hence g‘(8) = constant by corollary 5.1. 


Remark 5.38. In the mathematical definition of infinite integrals, finite range is always considered in 
the first place and then allowed to tend to infinity. Thus, following Jeffreys’ (1967), we can obtain 
improper MDIP pdfs in many cases by allowing their ranges to tend to infinity (See Zellner 1997, page 
130). McCulloch (1992) remarked that in the past people wanted analytical results and given the set 
of mathematical tools it was actually more convenient to let the parameter space be infinite. Now most 
of our work is done numerically so that in effect, we are using bounded closed intervals (set). 
Remark 5.39. According to E.T. Jaynes, Zellner’s MDIP can be interpreted as pdfs that maximize the 
entropy associated with the prior pdfs subject to the side conditions that average entropy in the data 
pdf be constant. 

Remark 5.40. Zellner (1971) used this criterion to obtain minimal information prior. 

Remark 5.41. MDIP approach allows one to derive diffuse and non-informative priors that are invariant 
with respect to relevant informations. 

Remark 5.42. If we use the criterion function 


D=]| g(@)log|F['” do] g(0)log g(0)d0 (5.36) 
where F is the Fisher’s information, then maximisation of D subject to i g(0)d8=1 gives 


g (0) «| F |” which is Jeffreys’ prior defined over a finite, possibly very large region of the parameter 


space. 
Remark 5.43. Lindley (1956) and Bernardo (1979) considered the criterion 


D, =| Il 2(x | 8) log g(6 | x0 | m(x)dx -| g(8) log g(8)d8 (5.37) 


with the objective that the average information in the posterior density be greatest relative to that in 
the prior, 
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where M(x) = Mi f(x | 8)g(0)d8. 


Result 5.5. The MDIP criterion function D is the sum of the information in an experiment plus an 
information in m(x) minus the information in the prior density. 
Proof. Note that 


—H(F(x,6)) = | | g(8| x)m(x) [log g(0 | x) + log m(x) ]d@dx 


x 


= | | m(x)g(6| x) log g(6| x)d@dx +] m(x) log m(x)dx 


x 


=D,+ | 2(0) log g(0)d0 + | m(x) log m(x)dx (5.38) 


However Result 5.3 gives 


-H(f(x,6))=D+2[ g(6)log g(0)d0. (5.39) 


° 


Equating (5.38) and (5.39) gives 


D=D,+ | m(x) log m(x)dx — | g(8) log g(8)d8. (5.40) 
x ° 

Zellner (1991) argues that his criterion D is broader than Lindley-Bernardo criterion D, since it includes 

additional terms reflecting the information in m(x) minus that in the prior g(8). 


Remark 5.44. Jaynes (1968) suggested maximisation of 


-| g(@) log g(8)d0 subject to | g(@)d@=1 and i 1(0)g(@)d@ = c, a constant, 


to obtain optimal g(8). While this gives the same form as that of Zellner, there is a need to choose a 
value of c in order to evaluate Lagrange multipliers. Both of these approaches involve dependence of 
the prior on the form of the density which is true of many other approaches for generating priors. Many 
consider this dependence to be reasonable.(See Zellner, 1991). 


Chapter 6 


Bayes Estimation 


The posterior distribution summarizes available probabilistic information on the parameters in the form 
of prior distribution and the sample information contained in the likelihood function. The likelihood 
principle suggests that the information on the parameter should depend only on its posterior 
distribution. Our job is to assist the investigator to extract features of interest from the posterior 
distribution. However, the information provided by the posterior distribution gets blurred by its 
complexity for large dimensions. The major problem is to find an efficient and coherent criterion to 
obtain reasonably good estimates of the parameters of interest. The criterion has to be based on the 
loss incurred in selecting an estimator. 

The overall purpose of statistical inference is to provide the statistician with an optimal decision 
based on some evaluation criterion. The evaluation criterion assesses the consequences of each decision. 

The estimation problem is a statistical decision problem in which the decision made by the 
statistician is his estimated value of the parameter whose values belongs to a subset of the k-dimension 
Euclidean space (k21). 

In this chapter we shall discuss Bayes estimation under various loss functions including some 
recently introduced loss functions in the literature. Generalized maximum likelihood estimation and the 
concept of highest posterior distribution credible interval as a summary of posterior distribution are 
discussed towards the end of the chapter. 

There is a vast literature on Bayesian decision theory including excellent books by Berger (1985) 
and Robert (2001) and proceedings of various conferences and workshops held during the last thirty 
years. 


6.1 ELEMENTS OF BAYES DECISION THEORY 


Suppose X,, X,, ..., X, is arandom sample from a population having pdf (or pmf f(x|®)). The set 


of all possible outcomes is the sample space denoted by S which is a subset of R". The parameter 0 
is unknown and the possible set of values which @ can take is known as parameter space ©. In decision 
theoretic problems, decisions are sometimes called actions. A particular action is denoted by ‘a’ and 
the set of all possible actions under consideration is denoted by c% The three main important aspects 
of statistical inference may be categorised in terms of number of elements in the action space c°/ as 
follows: 

(i) | Action space c%/ consists of two actions; c= {a,, a,}. Decision theoretic problems in such a 

case are called problems in testing of hypotheses. 

(ii) A = {a,, Bg. 2885 a}; k = 3. These decision theoretic problems are called multiple decision 
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problems. For example, when an experimenter is to judge which of two treatments gives a greater 
relief on the basis of an experiment. He may 

(a) decide treatment one is better; 

(b) decide treatment two is better, and 

(c) withhold judgement until more data are available. In this example, k = 3. 


(iii) c&Y%= (-co, cc), These decision theoretic problems referred to as point estimation of a real 


parameter. 
In particular, if action ‘a,’ is taken and 0, happens to be the true value of the parameter 0 then 


a loss L(®,, a,) is incurred. Thus, a loss function is defined for all (0,a)€¢ Ox c¥% For technical 


convenience, we shall consider only loss functions which satisfy L(@,a) 2—k > 9. 


Definition 6.1. If g(6) is the prior probability distribution of 0, then the Bayesian expected loss of an 
action ‘a’ is 


[ L@.a)g@)d6 if @ is continuous 


1(g,a) = E[L(0,a)]=4 (6.1) 
y L(6,a)P(@ = 9) if 0 is discrete. 


80 


Definition 6.2. A decision rule 6(x) is a function from S into c% If X = x is an observed outcome of 
the experiment, then 5(x) is the action which will be taken. 
Note that for a no data problem, a decision rule is simply an action. 
Definition 6.3. The risk function of a decision rule 5(x) is defined by 

R(, 5) = E(L@, 5(X))) (6.2) 
where the expectation is taken with respect to f(x|@). 
Remark 6.1. The risk function tells us the amount of expected loss if we use 5(x) repeatedly with 
varying X in the problem. 
For a no data problem R(0, 5) = L(8, 8). 

It is important to note that the Bayesian expected loss of an action (6.1) is a number but the risk 
function is a function of 0. 
Remark 6.2. The risk function (or frequentist risk) is used by the followers of Neyman, Pearson and 
Wald, to compare classical estimators and possibly obtain the best estimator. The approach being that 
the estimators should be evaluated on their long run performance for all possible values of 0. This 
criterion may not be appealing for a user who requires optimal results for his/her experimental data 
and not for the data which could be obtained by some other experiment. 
Remark 6.3. According to the law of large numbers, R(8, 5) is approximately the average loss over 
iid repetitions of the same experiment. 
Remark 6.4. For a decision rule 6, the risk function R(®, 5) is a function of the parameter 0. In the 
case of two intersecting risk functions, the comparison between corresponding decision rules is not 
possible. Frequentists hope that there exists a decision rule 6, that uniformly minimises R(@, 8) but such 
a situation rarely occurs unless the space of decision rules is restricted. 
Example 6.1. Hanif and Saurav simultaneously put up either one or two fingers. Hanif wins if the sum 
of the digits showing is odd and Saurav wins if the sum of the digits showing is even. The winner in 
all cases receives in rupees the sum of the digits showing, this being paid to him by the loser. Here, 
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each of these players has two possible choices so that the parameter space © = {1, 2} which is also 
equal to o&% the action space, in which “1” and “2” stand for the decision to put up one and two 
fingers, respectively. 
Let us give Hanif the label “nature” and Saurav, the label “statistician”. Let us consider this as 
a game between “nature” and a ‘statistician’, where “nature” chooses a state 0 € © and the 
statistician, without being informed of the choice nature has made, chooses an action a€ c% such 
that the statistician loses an amount L(0, a). Note that, a negative loss is interpreted as a gain. L(0, 
a) is the loss to the statistician if he takes an action a when 6 is the true state of nature. Thus 
Ld, 1) =-2, Ld, 2) = 3 = LQ, 1) and L(Q2, 2) = +4. 

Let us modify the game and suppose that before the game is played, Saurav is allowed to ask Hanif 
how many fingers he intends to put up and that Hanif must answer honestly with probability 3/4, and 
dishonestly with probability 1/4. Saurav, therefore, observes a random variable X (the answer Hanif 
gives) taking the values | or 2. If Hanif chooses to put up one finger (i.e. 8 = 1), the probability that 
X = | is 3/4. Similarly, the probability of putting two fingers by Hanif (i.e. 6 = 2), when X = 1, is 1/4. 
We have, therefore, sample space S = {1, 2}. The decision function 6(x), which maps S into c°% will 
be 

6,(1) = 1,8,2) = 1 

6,(1) = 2,6,(2) = 2 

§,(1) = 1,6,2) =2 

6,(1) = 2,6,(2) = 1 
Note that the decisions 6, and 6, ignore the value of x, since irrespective of whether x is 1 or 2, decision 
5, is | and 5, is 2. However, rule 5, reflects the belief of Saurav that Hanif is telling the truth and decision 
6, that Hanif is not telling the truth. Both 6, and 6, depend on the observed data x. 
The risk function may be calculated as follows: 

R®=1,5,) =f =1/0=1) Ld, 1) +fx=2|0=1)Ld, 1) 


3 1 
=—X(-2)+—x(-2) =-2 
4 4 


R(®=2,5) = 3 
R(O=1,5,) = f(x = 1/6 = 1) LG, 2) + f« =2|6 =1) Ld, 2) 
3 1 
=—xX3+—-x3=3 
4 4 


R@=2,5)= 4 
R@=1,8,) = f(x = 16 = 1) Li, 1) + f(x =2|6=1) LU, 2) 


3 1 
=—X(-2)+—-x3= 
4 4 


RO =2,5,) = -9/4 
R(6=1,8,) =f(x = 1/6 = 1) LA, 2) + f(x =2]0=1)LG, 1) 


3 1 ri 
=—X34+—-x(-2)= 
4 4 4 


and 
R(0=2, 6) = 5/4. 
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Hence, the risk table is 


a ES ESE 


In order to choose the best action with respect to the loss function L(®, a), we must determine 
the extent of any information available to us about the prevailing state of nature. In case, the value 
of the parameter 9 is known, the choice of action is straight forward. However, when @ is unknown, 
we may have either some prior distribution g(®@) or some observed data to throw the light on value of 
0. 

In case no observed data are available but 8 is known, we may choose that action for which 
L(Q, a) is as small as possible. However, if 6 is unknown but the prior distribution g(8) is available which 
reflects the relative importance, we may assess possible actions in terms of the prior expected losses 
and follow the conditional Bayes principle to choose the best action. 

Definition 6.4. The Bayes risk of a decision rule 5, with respect to prior distribution g(8), is defined 
as 


r(g, 8) = ERO, 8), (6.3) 
the expectation being taken with respect to prior distribution g(8). 
Definition 6.5. (Bayes Risk Principle) A decision rule 6, is preferred to a rule 6,, if 

r(g, 8,) < r(g, 8,). (64) 

The Bayes risk principle provides the Bayes decision rule which is optimum in the sense that it 

minimizes the posterior expected loss which is equivalent to the Bayes rule which minimizes the Bayes 
risk, 
Definition 6.6. A decision rule which minimizes r(g, 5) is called a Bayes rule, denoted by 6. The 
quantity 


1(g) = 1(g, 5,) (6.5) 
is called the Bayes risk of g. 


In a no data problem, an action ae c°% which minimizes Bayesian expected loss is called a Bayes action 
and is denoted by a,. The Bayes risk principle, in the no data problem, is known as the Conditional 
Bayes Principle. 
Example 6.2. Shyam wishes to decide whether or not to buy the shares of some risky company. If he 
buys the shares, they can be redeemed at maturity for a gain of Rs. 1000. However, there is a possibility 
of a default on the share in which case the original investment of Rs. 5000 could be lost. In case, Shyam 
invests his money in a fixed deposit, he will be guaranteed a net gain of Rs. 600 over the same period. 
He estimates the probability of a default to be 0.1. 

In this example, there are two actions a, and a,, where a, stands for buying the shares and a, 
for not buying the shares. Similarly, there are two states of nature 0, and 8,, where 0, denotes “no 
default occurs” and 0, the state “a default occurs”. The loss function is given by the following table: 
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Remark 6.5. In this case © = (. 0,) and c= {a,, a,} are finite and, therefore, the loss function is 
represented by the above table and is called a loss matrix. In such a matrix, actions are placed on the 
top of the table and states of nature 6 values along the side. The prior information that Shyam estimates 
the probability of a default to be 0.1 may be expressed as a prior probability distribution over a 
parameter space ©, such that P(9 = 8,) = 0.9 and P(0 = 9,) = 0.1. 
In our example, we find that Bayesian expected loss of action a, 
r(g, a,) = E(L(®, a,)) 
= L@,, a,) P(O =0,) + L(O,, a,) PO = 9,) 
= —1000x0.9 + 5000x0.1 = —900 + 500 = —400. 

Similarly, r(g, a,) = — 600. 
Since a, has smaller Bayesian expected loss and, therefore, it is the Bayes action. 

According to the conditional Bayes principle, the shares of the risky company should not be 
purchased. 
Remark 6.6. Note that the above example represents a no data situation, decision rules being simply 
actions and consequently risk function is simply the loss function. Therefore, the Bayes risk is simply 
the Bayesian expected loss and the solution using Bayes risk principle should be same. 
Remark 6.7. The above example points out that the Bayes risk principle and the conditional Bayes 
principle give the same answer. It is interesting to note that the prior g used in the conditional Bayes 
principle is a data modified version of the original prior distribution g used in the Bayes risk principle. 
Definition 6.7. The posterior expected loss of an action ‘a’, when the posterior distribution is g(@ | x), 
is 

p(g(@|x), a) = E(L@ .a)), (66) 
where the expectation is taken with respect to the posterior distribution g(0 | x). 
Definition 6.8. A posterior Bayes action 6 (x) is any action ae c°%% which minimizes p(g(6|x), a), or 
equivalently, which minimizes E[L(0 ,a)f(x | 6)], the expectation being taken with respect to prior density 
(8). 
Result 6.1. A Bayes rule 5, (i.e. a rule minimizing r(g, 5)) can be found by choosing, for each x such 
that m(x) > 0, an action which minimizes the posterior expected loss. The rule can be defined arbitrarily 
when m(x) = 0. 


Recall that m(x) =| f(x | 6)g(6)d@ appears as a normalising constant in the definition of posterior 
8 
distribution. 
Remark 6.8. If r(g, 5) = © for all 6, then any decision rule is a Bayes rule. 
Remark 6.9. There can be several minima of r(g, 5) and, therefore, several Bayes actions. 
Result 6.2. For a decision rule 6, 
1(g, 8) = Elp(g(|x), 5(x))], (67) 
where the expectation is taken with respect to the marginal density m(x) of X, for all values of x for 
which m(x) > 0. 
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Proof. On using (6.3), we have 
r(g, 8) =| R(@,8)g(0)d0 
2) 


= | | L(0, 5(x))£ (x | 8) g(0)dxd0 
12} Ss 
and L(O, 8), f(x | 8) and g(8) are all finite, we may invoke Fubini’s theorem to interchange the order of 
integrations to obtain 


1(g,5) = | | L(0, 8(x))f (x | @)g(®)dOdx 
= | | L(0,8(x))g( | x)m(x)ddx 


=[ p(e(@|x),8(x)m(x)dx. By Definition 6.7) 
Ss 
In case 0 and/or x are discrete, the integrals may be replaced by summations. 
Remark 6.10. The overall minimization of r(g, 5) has been called the normal form of Bayesian analysis 
(NFA) while minimization of p(g(6|x), a) has been called the extensive form of Bayesian analysis (EFA) 
by Raiffa and Schlaifer (1961). 
In practice, we obtain the Bayes rule 5,(x) by minimizing p(g(@ | x), a) with respect to a. Essentially, 
NFA provides a Bayesian solution to a decision problem when it is expressed in the sampling theory 
form in which the distribution over S is paramount. If the loss function is bounded and the prior g(@) 
is proper, then NFA and EFA give the same result. NFA considers the situation before the data is 
available, whereas, in EFA only the decision for that particular x observed is contemplated. 
Remark 6.11. Result 6.2 provides a procedure for determination of the Bayes estimates. Furthermore, 
from the Bayesian perspective, based on the conditional approach, the posterior expected loss is only 
important. It is waste of information to average over all possible value of x¢S when the actual observed 
values of the sample are known. The Bayesian approach works conditional upon the actual 
observations and also incorporates the probabilistic information about the parameter @ through the 
likelihood function. The equivalence established in Result 6.2 provides a connection between the 
classical results of the decision theory and the Bayesian approach which is based on the posterior 
distribution. It also explains why Bayesian estimators play an important role in classical optimality 
criteria. 
The Bayesian approach to statistical inference comprises the following steps: 
(1) Obtain the likelihood function which provides the probabilistic information about the unknown 
parameter 9 from the observed data. 
(2) Identify or select the prior distribution g(8) which expresses the knowledge about 0 before 
observing the data. 
(3) Use Bayes theorem to derive the posterior density g(6|x). 
(4) Derive appropriate inference statements from the posterior distribution. This may include specific 
inferences such as point estimates, interval estimates, or probabilities of hypotheses. 
(5) Incase the client wants to reach a decision, a loss function or utility function is constructed to 
reflect the amount of loss incurred for various possible decisions. The Bayes risk principle is used 
to reach the optimum decision. 
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Example 6.3. (Maddala, 1977) Let x = (x,, x,) be two independent observations from N(®, 1) population. 
Suppose we wish to obtain the Bayes estimate of @ as a linear function of x, and x,, that is, 
6. . (x) =¢,x,+¢,x,, when the loss function L(@,6. , (x)) =(c,x, +¢,x, —6)’. If the prior 
distribution of 6 is U(O, 1), then the risk function 


R@,5, .) =El(c,x, +¢,x, —8)'] 


=(c/ +c,)+0(c,+c,-1) 


and the Bayes risk is 


1, =E(R@.5,,.))=(c) +¢:)+E@' Ye, +e, -D’ 


=(c, +c¢,)+(c, +c, -1)*/3. 
The normal form of analysis requires minimisation of Bayes risk with respect to c, and c,. Equating 


partial derivatives of r, .. with respect to c, and c, to zero, we have 


2 
2c,+—(c, +c, -1) =0, 
3 
and 
2 
2c,+—(c, +c, -1)=0. 
3 


Hence, c,=C,= 1/5. 
Thus the Bayes estimate of is (x,+x,)/S. 

However, if we take the prior for 8 as U(0,1000), we find c, =c, = 1/2. Therefore, the Bayes 
estimate of 0 is (x,+x,)/2, which is the sample mean. 

Note that U(0,1000) represents more uncertainty about 0 than U(0,1), therefore, the result is not 
unexpected. 


Example 6.4. (Ferguson, 1967) Suppose a random variable X has U(0, 0) density with 
1/60 if0<x<@® 
f(x |8)= 
0 otherwise , 


the prior distribution for @ is 
Be” if @>0 
g(8) = 
0 if 0<0, 


Then the marginal distribution of X has the density 


+ e* if x>0 
m(x) =| f(x | ®)g(0)d0 = ( 


3 if x <0, 


and the posterior distribution of 8, given X = x, has the density 
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—(6—x) if ) 
x0]ay= 21200 f° i >x 


m(x) 0 if @<x, 


where x > 0. 

The posterior expected loss, given X = x, with respect to the loss function L(0, a) = c(6 — a)’, c>0, is 
E{L(@,a)|X =x}= ce" | (0-a)’e °d0. 

To find the action a that minimizes posterior expected loss, we may set the derivative with respect to 

a equal to zero and solve for a. Since the loss function is convex, the stationary point will provide a 

Thus, 


F . 
—B{L(0,a)|X =x}= ~2ce’ | (0—a)e dd =0 
da ‘ 


gives the Bayes estimate of 0 as 


+1. 


The posterior expected loss of the above Bayes estimate is 


E{c@ (x #1)" | x} = ef (8-(«+1))'e "de 


= ef (9-1) e “dd, for ¢=O-x 
= 

SOME STANDARD LOSS FUNCTIONS 

6.2 SQUARED ERROR LOSS FUNCTION (SELF) 


The squared error loss function (SELF) was proposed by Legendre (1805) and Guass (1810) to 
develop least squares theory. Later, it was used in estimation problems when unbiased estimators of 
6 were evaluated in terms of the risk function R(®, 5) which becomes nothing but the variance of the 
estimator. It was also observed that SELF is a convex loss function and, therefore, restricts the class 
of estimators by excluding randomized estimators. 

If we write L(® ,a) = h(@—a) and expand in a Taylor’s series about zero, we have 
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h(@—a) = h(0) + (@—a)h (0) + 0- a) h’(0), 


after neglecting higher order terms of (8 —a) which will be negligibly small when (6—a) is small. Writing 
c, =h(0), c, =h’(0), c, =h’(0)/2, 


we have 


L(6,a) =c, +c,(@-a)+c,(@-a)’ 


2 2 

Cc, Cc. 
=c,| @-at+ +c, —, 
2c, Ac, 


If c, is a positive constant, this loss is equivalent to the linearly transformed loss 


L(9,a) =| 9-at fo | 
2c, 


This is of the squared error form except for the constant c,/2c,. 

Remark 6.12. Note that the decision problem with L(@, a) = (8 —a + c)* requires consideration of a new 
action space a = {a —-c : such that ae c%}. For a’e oc", the loss corresponding to L is 
L*(6, a°) = (8 — a’). If 6°is an optimal decision rule in the transformed problem then 6 = 6*+ c will be 
optimal in the original problem. 


Example 6.5. Suppose L(0,a) =1-—exp (-© = a)? /2) 


1 2 
=—(0—-a) 
2 


Remark 6.13. Since L(0, a) is of the form h(6 — a), it can be approximated by a SELF. 

Definition 6.9. A loss function satisfying the following conditions: 

(i)  L(0, a) = L(® — a), is a function of (8 — a) only, 

(ii) L(®, a) is asymmetric in (8 — a), 

(ii) L(@, a) is bounded above by one and below by zero, 

(iv) L(@, a) is an increasing function of |6 — al. 

is called an estimation loss function. 

Example (6.5) provides estimation loss function. 

Remark 6.14. The difficulty with unbounded loss functions, like SELF, is that Bayes estimates may 
change enormously when the observation of the random variable changes infinitesimally. Therefore, the 
investigator has to be absolutely precise about his probability statements. Furthermore, in real life 
situations, it will usually being impossible to lose an infinite amount of money. Thus, we may need 
bounded loss functions as given in the Example 6.5. 


162 Bayesian Parametric Inference 


Result 6.3. The extensive form of analysis provides Bayes estimate of 8 under SELF, as E(6 | x). 
Proof. For L(@, a) = (0 — a), we minimize r(g, a) = E[(8 — a)’ | x], with respect to a. 
Differentiating r(g, a) with respect to a, and equating it to zero, we have a = E(0 | x). 

In order to show that E(@ | x) is a minimum of r(g, a), we note that 


d’ d’ 
—1(g.a)= —E((0-a)’ | x)= 2%, 
da da 


Hence the decision rule a( = 6(x)) is a Bayes estimate for 8 under SELF. 
Remark 6.15. Squared error loss function is not the only loss function for which posterior mean is 
the Bayes estimate. We shall see later that for the natural exponential family 
f(x | 8) = c(8) h(x) exp(6x), 
the Bayes estimate under entropy loss function is the posterior mean (see Robert (2001), page 82). 


Remark 6.16. In case @ is a k-vector (0,,9,...., 0.) to be estimated by a= Con ae ed then 


L(@,a) = (@- a) B(®—a) is called a quadratic loss, where B is a kxk positive definite symmetric matrix. 


In particular, when B is a diagonal matrix 


L(@,a)=)° b,(6,-a,)’, (6.8) 


which is the natural extension of SELF to multiparameter situation. 
Result 6.4. The Bayes estimate associated with the weighted SELF 

L,,(0,a) = @(8)(6-a)’, (6.9) 
(9) > 0 for all Oc @, is 


E(6a(8) | x) 
go 
E(@(8) | x) 


» provided the expectations exist. (6.10) 


Proof. Since r(g, a) = E(@(6)(0 —a) | x), differentiating r(g, a) with respect to a and equating it to zero, 
we have 


E(0a(8) | x) 
a = ———_.. 
E(@(8) | x) 


In order to show that E(@@(8) | x)/E(@(0) | x) is minimum of r(g, a), we note that 


d 
— r(g,a) = 2(8) > 0. 
da” 
Hence the decision rule 5(x)(=a) is a Bayes estimate of 0. 


Remark 6.17. The Bayes estimate under the weighted SELF may not exist if the weight function (8) 
increases too fast to infinity. 
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Example 6.6. Suppose X = Ceres fg Os ) is a random sample from U(0, 8), the prior distribution is 
g(9) = 1, 8 € [0,1] and the loss function is 


1 , 
L(,a)= — (a—6)’. 
0 


The posterior distribution of 0, given xX =(X,,X,,...,X,), is 


g ee | -] ? 
= ) ny D 1 - 


1 
where x,,. = max (X,,X,,...,X,). Since the weight function @(@) = —, the Bayes estimate of 0 is 

EO"! |x) 1 : 
E(0” |x) = oe/l ge n+2 
n+1/ 1-x,,, 

= ——_ |x... 

n+l (n) 
n | 1-x 


(n) 
Example 6.7. Suppose X,, X,, ..., X, is a random sample from Pareto density 
k 


f(x |8)= (x); 8,k >0, k known. 


xe Ties 


Since X,, = min(x,,X,,...,X,) is the sufficient statistic for 0, the conjugate prior for 0, as in Result 
4.12, is 


g(8) = = lag Oe 3) 
m 


and the posterior pdf for 0 is 


_ n+ B n+B-1 _ : 
g(8| x)= =e: m, = min(x,,,m). 
m 


(1)? 
1 


The Bayes estimate of 8 under weighted SELF 


L(@,a) = an -6)’, 
0 


EO'1)  mn+B-2) 
E(@” | x) n+B-1 — 
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Result 6.5. If L(®,a) = (h(@)—a)’, then the Bayes estimate of h(@) is E[h(®) | x] and, more generally, 


; _ E(@(0)(h@) | x) 

if L(0,a) = w(6)(h(8)—a), then the Bayes estimate of h(@) is —~———————.,, provided 
E(@(8) | x) 

expectations exit. 

Example 6.8. Suppose X,, X,, ..., X, is arandom sample from Poisson distribution with unknown mean 

6 and the prior distribution of 6 is known to be Gamma(q, 8). Under the SELF, the Bayes estimate of 

h(8) = 6", r> 1, is 


1 T(a+r+2Xx,) 
(B+n) IT(a+Xx,) 


E(@ | x)= 
provided @ + r + Xx, > 0 and B + n > O. In particular, if we let a, B — 0 and t = Xx, the Bayes estimate 
of 6" 


1 
E@' |x)=4n' 


0) otherwise. 


t(t+l)---(t+r-l) if t+r>1 


t 
In particular, for r = 1, E(®|x)=—, which is an unbiased estimator of 6. However, for r = 2, 
rt 


_ t(t+) 


E(@ is not an unbiased estimator of 07. 


xX) 
~ n 


It may be interesting to compare Bayes estimate with the frequentist UMVUE. 
eA’ 


r! 


Example 6.9. (Example 6.8 continued) The Bayes estimate of P(X =r) = under SELF is 


<* 
E x 
r! 


In particular, for a non-informative prior g(0)<1/6, that is, (a, B) > 0, we have 
e 6’ t+r-l n : 1 
E x |= , 
r! r n+1 n+1 


which is the pmf of NBin C 


— B+n)"Ta+t+r) 
T(at+t)r(Btnt) 


distribution. It is interesting to recall that the UMVU estimator 
n+l 


1 
for P(X =r) is Bin C = } 


n 
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Example 6.10. Suppose X ~ Gamma(k, 9) and 8 ~ Gamma(, 8). Suppose we are interested in estimating 


1 , 
h(6) = p under SELF, i.e., L(9,a) = c - 4 . The posterior distribution of 6 is Gamma(a + k, B + 


x) and, therefore, the Bayes estimate of h(®) =1/6 is 


1 B+x ; 
E| —|x |=——-, provideda+k> 1. 
at+k-—-1 


However, if we consider a scale invariant loss function 


1 M3 
L,(0,a) = a-— |, 
) 


1 
then the Bayes estimate of h(®) =— is 
6 


E(0| x) B+x 
E(@’|x) o+k+1 


We see that the scale invariant loss function may be more relevant for the estimation of the reciprocal 


. 1 
of parameter, i.e., —, 
0 


Duality between Prior and Loss 


It is interesting to note that if we consider a modified prior distribution 


g(8)a(8) 
g,, (0) = ————_., 
| 2(®)a@ya0 


e 


@(8) > 0 


then the Bayes estimate associated with the prior distribution g (6), under SELF, is equal to E(@ |x) 
where the expectation is taken with respect to g (0 | x). It is obvious since the Bayes estimate a, with 
respect to g under the loss L,(, a), is obtained by minimising 


| L,, (0, a)f (x | 8)g(@)d0 = | w(8)(0 —a) f(x | 6)g(8)d0 
« | (-a)'f(x| @g,(0)d0 
Hence, from Result 6.3, the Bayes estimate of 8 with respect to the modified prior g,(@) « w(8)g(8) 


is the mean of the posterior distribution g,(0| x) « g,(8)@(0| x). 
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Remark 6.18. (Raiffa and Schlaifer, 1961) Suppose L(®, a) be any simple loss function and 
L,(0, a) be a family of weighted loss functions such that L,(0, a) = L(®, a)w(®), where ® is some non- 
negative function of 0. If g denotes the conjugate density of ® with hyperparameter a, then the prior 


expected loss of a is | L,, (8, a)g(8 | &)d®. However, if @ is such that 
w(8)g(0 | &) = g(6| a ), (6.11) 


then the prior expected loss is | L(0,a)g(@ | )d®. Thus any analytical results obtained from (6.11) 


for the simple loss function L apply unchanged for the modified loss function L,,. 
Example 6.11. Suppose an observation x is obtained from N(®, 1) density and the prior for @ is 


x l 

N(O, 1), then the posterior density of 0 is NF _ } then under weighted SELF, L (0, a) = 6(0 — a) 
2 2 

is equal to 


BO’ |x) x +2. x°+2 
E(@|x)  4x/2. 2x 


2 
-8/2 


If we, however, consider estimation of 6 with prior g,,(9) ~ 8e under SELF, L(@, a) = (0 - a), then 


the Bayes estimate of 6 is the mean of the posterior distribution 


2 2 
g,(8| x)= Bexp( (0 x/2)°). 
XV 


Thus, the Bayes estimate of @ with respect to prior g,, under SELF is 


4 


vee 
ae 


1 x \)2 x’4+2 
= + = : 
2 4 x 2x 


Remark 6.19. The above example illustrates that using the weighted loss function may be 
mathematically more convenient to obtain the Bayes estimate than using weighted prior distribution 
under the more convenient SELF. In some examples, it may be the other way. 


E(6| x)= | Og (8 | x)d0 = ~0-x/2)" ag 


A) —a . 2 2 
Example 6.12. Let X ~ Bernoulli(®) and L,(8,a) = eS) =6 (8—a) so that w(0) = 0°. If we 


assume that 8 has a prior distribution Beta(r, n), then the prior expected loss under L, is 


1 r-l _ n-r-l | =] 229) , 
J ere gee gg | UE 6 ee 0a. 


B(r,n—-r) » «(-Da-2) 
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The problem is then identical to a problem with simple SELF and the hyperparameters of the prior are 
modified accordingly but the algebraic form of the prior density of 8 is not changed. 


Example 6.13. Suppose X,, X,,...,X, is a sample from Bernoulli distribution with probability of success 


8 having U(0, 1) prior. Consider the loss function 67(8 — a)’. Using the duality between prior and loss 
functions, we may consider prior g(@)«< 0" , r>0, and the loss function SELF. The posterior distribution 
of 8 is Beta(1+2x-1, 1+n—2x,). Therefore, the Bayes estimate of 0 is (Xx, —r+ 1)/(n —r + 2) provided n + 2 
> r. Thus, Bayes estimate of 8 will exist only if r < n+ 2. Since the weight function @(6) = 6* tends 
to co as 8 > 0, Bayes estimate will not exist if n+ 2 <r. 

Remark 6.20. C.P. Robert (2001) calls such a possibility of shift of weight function from loss to prior 
and vice-versa as duality between loss and prior functions. 

Remark 6.21. According to Jaynes (2003, page 420), prior probabilities are usually far more objective 
than loss functions, both in the mathematical theory and everyday decision problems in real life. In 
the mathematical theory, we have general formal principles for example maximum entropy, transformation 
groups, marginalisation, that remove the arbitrariness of prior probabilities for large class of problems. 
But we have no such principles for determining loss/utility functions. This suggests that we should 
avoid using complicated loss functions unless there is a reason to do so. 


Example 6.14. (Berger, 1985) Suppose X ~ N(0, 100) and 8 ~ N(100, 225) and 


6-100 


2 
L(6,a) = (0- a)” exp -[ | . If X= 115 is observed, the posterior distribution of 6 is N(110.39, 


69.23). Using the duality between the loss and the prior, let us reframe the problem by defining the 


8-100 ) 1 (6-100 ) 
prior g(8) x exp] — exp| -—| ———— | |. So, that the prior distribution ¢(8) becomes, 


30 2 15 


after simplification, N(100, 450). So, the posterior distribution of 8 is N(1 12.28, 81.81). Hence, the Bayes 
estimate of 6 under SELF is the posterior mean 112.28 and it is also the Bayes estimate of 6 for the 
original problem. 


Modifications of SELF 


An important question is “How should one choose the weight function?”. Whittle and Lane (1967) 
suggested that the weight function (8) should be Fisher’s information I(®), since the Bayes estimate 
may then be viewed as the classical minimum variance unbiased estimator. 

Example 6.15. Suppose X ~ Bernoulli(®) then the Fisher’s information for 0 is I(@) = 1/0@(1—8). Thus, 
Whitte and Lane’s suggestion gives weighted SELF as (@—a)’/0(1—6). 
Example 6.16. Suppose X ~ N(0, 6”). The Fisher’s information for 6”, I(o”) = 1/20+, gives weighted SELF 


as 
(o°-a)’ 1(o'-a) 1 2 ; 
20° 21 o 2 o } 
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Example 6.17. For X ~ Gamma(k, 8), k known, the Fisher’s information [(6) = k@~’. The loss 


L, (9, a) = 


2 


1 1) 
c 5) may be considered as weighted SELF for the transformed parameter 1/6. On 
0 


using Result 6.5, the Bayes estimate of 1/0 will be E(6™|x) when the prior is modified to Gamma(at-2, ). 
Shapiro and Wardrop (1980) extended this loss function to a larger family of weighted SELF as 


L*(0,a) = exp (Or — B()s )1(0)(8 — a)’, (6.12) 


where r and s are parameters that allow a greater variety of shapes for the loss functions. The exponential 
term on the RHS is the kernel of the conjugate prior for the parameter 8 when the sampling distribution 
belongs to the one-parameter exponential family of distributions. It is interesting to note that Shapiro 
and Wardrop’s generalization of the weighted SELF is a combination of ideas given by Raiffa and Schlaifer 
(1961) and that of Whittle and Lane (1967). 

Makov (1994) gives an interesting comparison of Bayes estimates of the Poisson mean @ when 
the prior distribution on @ is Gamma(, 8) under five loss functions L,(@, a) = (8-a)’, L,(8, a) = (8-a)”/8, 


L,(, a) = (6-a)’/0’, L,(0, a) = (loga-log6)’, and L, (0, a) = (vo - Va) . The Bayes estimates of 8, when 


T(x+0.5) 


a=6=0, arex,x—l,x-2, exp (‘Y(x)), and 
(x) 


j d 
, respectively, where (x) = —logI(x). 
dx 


Squared Logarithmic Loss Function 


If 8 is a scale parameter and the loss function is a function of a/0, i.e., 
L(O, a) = L(a/8), then making the transformation o = log®@ and & = loga, we have 


L, (,0) =L(e’,e*)=L(e*”). (6.13) 
Thus, if we define L(x) = L(e*) then L,(@, &) = L,(a-0). Therefore, in the case of scale parameter, we 
should consider the loss to be squared error in log®, instead of simple squared error loss L(®, a) = (@— 
a)’. 


Example 6.18. Suppose X = (X,,X,,...X,) is a random sample of size n from N(0, 6’), the prior 


distribution of o’ is Jeffreys’ non-informative prior g(o*)« 1/6’, so that g(o°|x) is an 
1 1_, : ; : 

Inverted-Gamma] —n,— x, |. If we consider the loss to be squared error in logo”, i.e., 
2° 2 


L(o’,a) = (loga logo’) , (6.14) 


the Bayes estimate of 6° is a = exp (E (log o )), where expectation is taken with respect to g (o Ix) : 


Since, 
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ex "2 tte Ex? _— 
E(log 0” (5) |) : o[ -Z Josose'/t(5 
=-[2> e(-aoe® fe /1(2} 
f S 2 
= ie sf22 exp (—z)dz = feo exp (-z)log nr} /x(2)} 
0 0 2 

n 

-toes-¥( 2 | 
2 


d 4, n 
where ¥(Q) = —logI (a), 2S= ee and 2e0o(¥(5) = n-—1 (Jahnke and Emde, 1945). For 
da 2 


i=l 
large n, we have a = bh a /(n —1). If we had used SELF to estimate 0”, the Bayes estimate of o? would 


have been E (o° Ix) = pd a /(n -2). 


2 a } : 
On the other hand, if we had used invariant loss function L(o’,a) = [ -1 , the Bayes estimate 
o 


s. Us) Gs) = 
(3) (5) Sy 


Example 6.19. Suppose X = (X,,X,,...X,) is a random sample from Pois(@) and that the prior 


of 6’ will be 


Us 


distribution of 8 is Gamma(q, B). If the loss function 
2 a i 
L(0,a) = (log a —log 0) = log . : (6.15) 


the Bayes estimate of 0 is given by a = exp [E (log 6| x )]. provided posterior expectation E (log 6| x) 


exists. Since, 


| loge 6" 'e “a0 = a (W(a) -logB), 


0 
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d 
where ‘(x)= —logI (x), we have 
dx 
a=exp [Po + Zx,)—log(B + n)]. 


Caution 


Result 6.6. Suppose t(x,, x,, ..., X,) is a Bayes estimate of h(®) with respect to the prior distribution g(®) 
under SELF. If the estimator t and h(8) have finite variances then either Var(t|®)=0 or t is not an unbiased 
estimator of h(@). 

Proof. Suppose t(x,, x,, ..., X,) is an unbiased estimator of h(@), i.e., E(t(x,, x 
Bayes estimator of h(@) rider SELF, t = E(h(8) | x .. X,). Now, 


. X,)|8) = h(®). Since tis a 


p X5> oe 


Pp Xs» . 


Var(t) = E(Var(t | 6))+ Var (E(t | 9)) 
= E(Var(t | ®))+ Var (h(6) ) 


and Var (h(8)) = E[ Var(h(®) | x,,x,,....X,)]+ Var [E(h(®) | x,,x,,-...x, )] 


Xo. 
= E[Var(h(@) | x,,x,,...,X,)]+ Var(t). 
Hence 
E(Var(t | ®))+E[Var(h(@) | x,,x,,...,x,)]=0. 
Since both the terms of the left hand side are non-negative and their sum is zero, we have 
E(Var(t | ®)) =0, 


and since Var(t |6) is non-negative and has expectation zero, we have Var(t | 8) = 0. Thus either t estimates 
h(9) with probability one or t is not an unbiased estimator of h(@). 
Example 6.20. Suppose X ~ Bin(n, 0). Consider maximum likelihood estimator (or unbiased estimator) of 


x \ 61-6 
6 as 6(X) = X/n. Since F( a] = ‘ , is zero if, and only if, 8 =0 or 1. Hence X/n can be Bayes 
n 


n 


estimate of ® for the prior g(®) which assigns probability 1 to the set {0, 1}. With this prior, the Bayes 
estimate 6(X) =0, if X = 0 is observed, and 6(X) = 1 if X =n is observed. Obviously, with such a prior, 
we should expect to observe X = 1, 2, .... n— 1. This example suggests that the classical estimator X/n of 
8 can be Bayes only under very special circumstances. 


Example 6.21. Suppose X,, X,, .., X, is arandom sample from N(0, 6’). We know that E (x) = 0, hence 


X is an unbiased estimator of 6. Since E(X— 0)’ = Oo /10. which is independent of 8, hence for any 
(proper) prior distribution g(8) of 8 


2. 


E[E(X-6)’ | 6] = E(X —6)* = are 
n 
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Hence X is not a Bayes estimator of 8 for any (proper) prior distribution of 0. 

Example 6.22. Suppose X has a N(0, 1), 8 unknown. The maximum likelihood estimator of 0 is 
5(X) = X. It is also an unbiased estimator of 8. Consider the loss function to be SELF then the Result 6.6 
suggests that 5(X) = X cannot be a Bayes estimator of 0 for any proper prior distribution. We note that 


E (08(X)) = E(E(08(X) | X) = E(8(X)E(| X)) 


= E(8(X))° (since 6(X) is assumed to be unbiased estimator of 6) 
However, after interchanging the roles of 8 and X, we have 
E(08(X)) = E (E(08(X) | 6) = E(@E(8(X) | 8) ) 
= E67) (since, 6(X) is assumed to be Bayes estimate of 8 under SELF) 


Thus, E(@—8(X)) =E(@’)—2E(08(X)) + E(5(X)’) 
= E(08(X)) — 2E(08(X)) + E(08(X)) = 0 


and also E(@—6(X))° = E(E(@—X)’ |) =E() =1. 

This contradiction implies that the assumption that a Bayes estimator 6(x) is also an unbiased estimator 
cannot be true. 

Remark 6.22. It is interesting to note that if we decide to choose the prior for @ as 


2 2 


XO oO 


N(0, 6”), the posterior distribution is N 
l+o 140° 


“| Under the SELF, the Bayes estimate of 0 is 


2 


xO 
6, (X) = > having a Bayes risk 
l+o 
2 2 
‘ o o 
E(E(0—6, (X))* |X) =E 25 
1+o l+o 


As 60>, 5. (X) > 6(X) = X. The normal prior distribution as 6 — © approaches an improper (flat) 


prior. Hence, even though 6(X) = X is not a Bayes rule it may be considered as almost a Bayes rule. 
This idea is formalized in the form of a generalised Bayes rule. 


6.3. GENERALISED BAYES RULE 


Definition 6.10. If g is an improper prior but 5 .(x) is a decision rule which minimizes P (g(®| x),a), for 


each x with m(x) > 0, then 5, is called a generalised Bayes rule. 
Example 6.23. Suppose on N(9, 1) and the loss function L(8, a) = (@—a)’, then the Bayes rule 


5(x) = E(6 | x). However, there is no proper prior distribution such that E(@|x) =x for all x. In case 


g(0) «1, Oe (-<, co ) , which is an improper prior, 5,(x) = x will be a generalised Bayes estimate of 0. 
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Example 6.24. Let X = (X,, X,,...,X,) be a random sample from N(0,0°). If we choose g(o”) « (0’Y'. 
Then the prior is improper for all r > 0. The posterior distribution 


g(o | X) (o°)'(o°)*” exp (-xx? /20° ) 
is proper for all x ifr <(n—2)/2. The posterior expected loss exists if r < (n — 6) / 2. Thus the generalised 


Bayes estimate of 6”, under SELF, is E(o" | x)= a /am —2r-4). 


Example 6.25. Let X ~ Gamma(q, 9), when & is known. Suppose the prior distribution of 0 is 
g(8) « 1/0, which is Jeffreys’ non-informative prior and is also improper. Since the posterior distribution 
of 8 is Gamma(q, x), the generalised Bayes estimate of 6 under SELF is E(0 | x) = o/x. 

Remark 6.23. The distinction between regular and generalised Bayes estimators is important since the 
former are admissible and also unique when the loss function is convex. 

Remark 6.24. We may note that the notion of prior distribution is extended to include non-finite pdfs 
on parameter space ©. When this is done, it is no longer easy to keep probabilistic interpretation of the 
analysis. In particular, the marginal distribution of X may have infinite mass. Even then, by analogy, we 


may seek a function 6(X) which minimizes | L(0, 5)f (x | 6)g(6)d6, where g(8) is an improper prior 
° 


distribution of @ as far as the integral is finite. An improper prior distribution is sometimes called a 
generalised prior distribution and, therefore, the corresponding Bayes rule is called a generalised Bayes 
rule. 


6.4 BILINEARLOSS FUNCTION 


Bilinear loss function is defined as 
u(@-a) if O>a 
L(6, a) = (6.16) 
v(a-8) otherwise , 


where u and v are positive real numbers. The function increases more slowly than the SELF. Therefore, 
while remaining convex, it does not over-penalise large but unlikely errors. The bilinear loss function is 
an asymmetric loss function which is useful in representing unequal losses for over- and under-estimation. 


u 
Result 6.7. The Bayes estimate associated with a prior g and the bilinear loss function is the } 
utv 


fractile of g(0| x). 
Proof. The Bayes estimate a of 8 under, bilinear loss function, is the solution of 


* 5 (L@,<)) =0. 
da 


Since 
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E(L(0,a))= uf (8—a)g(O| x)d0+ vf (a —0)g(0| x)d0, 


on using Leibnitz’s formula for differentiation under the integral sign 


= { f(x, 0)dx = { — f(x, 0)dx + f (b(8), 6) a f (a(8), 8) = , (6.17) 
d dé dé , 


a(@) a(@) 
we have 


0 


aa (@-a)g(@ | x)d0 = -uf g(8| x)d0 
da 


a a 


a 


— 
and ma (a-@)g(@| x)d0=v{ g(6| x)do. 
a 


so so 


rs) °c a 
Thus, —E(L(@,a))=—uf g(@|x)d0+v{ g(@| x)d0 
da 


a co 


--(1-] noise |r) 2(0| x)d@ 


co 00 


=-ut(u+v){ g(@|x)dé. 


Hence the Bayes estimate of 8 is given by 


a 


J 2(@|xy4e=——. 
ut+v 


co 


Thus, a is the 
utv 


} ste of g(0 | x). 


Remark 6.25. For u = vy, the bilinear loss reduces to the absolute error loss and the Bayes estimate is 
the posterior median. 

Example 6.26. Suppose X,, X,, ..., X, is arandom sample from N(0, r), r known, and the prior for 6 is N(u, 
t). The Bayes estimate of 9, under absolute error loss, is the median of the posterior distribution 


Tj + nrx 
N eee 
t+nr 


T+ nrx 
,t+nr |. Thus, the Bayes estimate of 89 is ————— 
tt+tonr 


Example 6.27. (Ferguson, 1967) Suppose X =(X,,X,,...,X,) is a random sample from the Pareto 


density 
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k 


k0 
f(x |®)=——I,.,(x), 9 >0,k > 0 known. 
Xx 


Using the conjugate prior for the parameter 0, find the Bayes estimate of 8 when the loss function is 
; a } 
@)  L(@®,a)=|—-1], n>3 
0 
Gi) ~—-L(@,a) =| loga—log | 


“5 a 
(ii) L@.a)= [2-1 n>2. 
6 


Solution. The likelihood function of 8, given x, is 


0(0|x) =k’ oxo] ke S| Vio, 8) UES : 


i=l 0 
Since X,, = min(x,,X,,...,X,) is the sufficient statistic for 6, the conjugate prior for @ is 
Bop 
g(8) =— 0'"T,,,,, (9). 
m 


The posterior pdf for @ is 


+ 
n B ery 


n+p (0,m, i( 


g(8| x)= 8), m, =min(x,,,m) 


(1)? 
1 


a a 
(i) The Bayes estimate under L(6,a) = - “] is 


B(e"|x) 
Since E(@" x)= uaa 
m,(n+B-1) 
and E(0” eee ee 
m, (n+B-2) 
Hence i cee 


n+p -1 
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(ii) |The Bayes estimate under the loss L(0, a) = | loga — log6| is obtained by minimising the posterior 
expected loss. Since 


2 (log a — log @)g(0 | x)d0+ | (op8—Iogape@] x0 |=0 
aly 


a 


gives 


a 


J 20 x)d0 = f 2(0 | x)d0. 


0 


Hence a is the median (M) of the posterior distribution. Since median cannot lie outside the range 
(0, m) of 8, we must have m > M. Therefore, the Bayes estimate, which is posterior median, is given 
by 


* +n 1 
| BAM petgg 1 
: m?" a 


Thus the Bayes estimate of 0 is 


1 
1 \p 
M=m,| — : 
2 


a 
(iii) The Bayes estimate of 6 under the loss L(8,a) = e- 


2 [5-1 sel [1-£ Jot ote] =o 
da|*, \ 0 : ) 

1 + 4 
or | 5 20x90 = J 5 20 lx009 

ie 
or aE 


60 1 n+ B me - antes 
Since | oo nm” n+B-1 ? 


» 1s given by the solution of 


= 4 
aa 5 201 x8 


a 


a 1 


hence solving 


n+ _ 2(n + B) ee ag 


m,(n+B-1) m"? n+B-1 


1 
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1(n+B-1) 
for a we get the Bayes estimate 8 as a= m, a . 


6.5 LINEXLOSS 


The linex loss function is an asymmetric loss function, which was introduced by Klebanov (1972) 
and used by Varian (1975) in the context of real estate assessment. Zellner (1986) used it for estimation 
of a scalar parameter and prediction of a scalar random variable. Both Zellner (1986) and Varian (1975) 
have discussed its behaviour and various applications. The linex loss function is defined as 


L(0, a) = exp(c(a-8))-c(a- 6) 1, (6.18) 
where c # 0. The constant c determines the shape of the loss function. In particular, for c > 0, L(@, a) 
increases almost linearly for negative error and almost exponentially for positive error. This is why it is 
called LinEx (Linear Exponential). Hence over-estimation is considered to be a more serious mistake 
than underestimation. When c < 0, linear and exponential increases are interchanged, resulting in under- 
estimation to be more serious than over-estimation. For small values of | ¢ |, 


L(@,a) = Fae (6.19) 


Thus, Linex is almost symmetric and not too different from a squared error loss function and, therefore, 
Bayes estimates and predictions, based on linex loss, are quite near to those obtained from SELF. 
Remark 6.26. If A = a — 0 is a measure of discrepancy between estimate a and the parameter 0, consider 
a loss function 

L(A) = be“ — cA—b; a,c #0, b> 0. 
This will be a valid loss function if 
(i) L(O)=Oand 
(ii) has aminimum at A = 0, that is, 


L’(0) =boe“ —c|,,=0 or ba=c. 


Thus, 
L(A) = b(e“- aA- 1); a #0, b> 0, 
will be zero at A=0 and minimum occurs at A= 0. 
In this representation, the parameter b in L(A) serves as a scale parameter and the parameter 
determines its shape. 
Result 6.8. The Bayes estimate under the the linex loss function (6.18) is 


1 
a = ——log E(e “’), provided expectation exists. (6.20) 
c 


where the expectation is taken with respect to the posterior distribution of 8. 


d 
Proof. The Bayes estimate a of @ under the linex loss function is the solution of ae a)) =0) 
a 
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Since 


E(L(@,a))=e"E(e”) —ca + cE() -1, 


we have 
d one 
—E(L(@,a))=ce“E(e “)-c, 
da 


where the expectations are taken with respect to g(0| x). 


Hence, the Bayes estimate of 0 is 


1 <0 
a=—-—logE(e - ) 
c 
1 
=——log M,, (-c) 
c 


1 
=--K,,(-c) 
Cc 


where M,, (+) and K,, (+) are the moment generating function and the cumulant generating function of 
g(8| x), respectively. 


Example 6.28. Suppose X =(X,,X,,....X,) is arandom sample from N(0, 6”) where o is known. If the 


prior density of @ is g(®) « constant, then the posterior density of 8, given x, is N(x, o / n) having 


22 
co 


megf M,(t) = asl + ae 


} Hence, the Bayes estimate of 0, under linex loss, is 


1 = Oe los. ee 
a =——logexp} —cx + =X . 
c 2n 2n 
Remark 6.27. If 6? is also not known then we may replace o? by its unbiased estimate 


2 


A —_ cs 
x(x, — X)° to obtain the Bayes estimate as x - —. (See Zellner (1986), page 448). 
n-1 2n 


3 


s = 


Remark 6.28. If in Example 6.28, we take prior distribution of 6 to be N(0, 1’) instead of the non- 


Bo til ag baogt x fo) 
informative prior, then the posterior distribution of 8, given x and 6’, is N}| ——, , where 
1+ n(+A) 


2 


oO ; : 
N= se The Bayes estimate of 8, under linex, becomes 
nt 
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x co” 1 


= = a. 
1+A 2n(1+A) 14+A 


a, 
Thus use of the conjugate prior for 9 results in shrinkage of a towards zero (the prior mean). 


Example 6.29. Suppose X = (X,,X,,...,X,) is arandom sample from Poisson distribution with unknown 


parameter 6 and let the prior distribution of 8 be Gamma(q, f). The posterior distribution of 8, given X, 


a/2 


is Gamma( + ©x,, B + n). Since the mgf of Gamma(, B) distribution is (1-t/B)", 


t/B| <1, the Bayes 
estimate under the linex loss is 


O+Ex; 


1 c 2 a+ Xx, c 
a=——log|} 1+ = ——— log} 1+ : 
c Btn 2c B+n 


Example 6.30. Consider an observation X from the one-parameter exponential family 
f (x | 8) = exp(x0 - B(®)-— M(x)),  -00 <x <o, 
For the natural conjugate prior 
g(O0|n,,xX,) =exp [n,x,0- n,B(8)— K(n,, x,)], 


the posterior distribution is 


X+Nn,X, 
g(6|x,n,,x,) =exp| (x +n,x,)6—-(n, + DB(6)—K]| n, +1, -——— |]. 


n,+1 


The Bayes estimate under the linex loss is 


—c8 


1 3 
a=——log | e g(0|x)d0 
c oo 


1 T x+n,x, 
=-=log | exp| (x +n,x, -c)@—(n, +1)B(®)-K| n, +1, | | a0 
c Ba n, +1 


1 X—C+N,X, x +n)X, 
=——log exp} K| n, +1,———— |-K| n, +1,——— 
c n,t+1 n, +1 


In particular, suppose X ~ N(@, 1), then we have 


0 x 1 
B(8)=—, M(x)=—+-—log 27. 
2 2 2 
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The conjugate prior is 


Q 
tl ng x) =299| 1,82, 2 -KiO, | n, >0, 


1 : 2m 
where K(n,,x,) =—| n,X, +log] — |}. 
2 n, 


The posterior density 


X+N,X, 
g(0|x,n,,x,) =exp (x +n,x,)8—(n, +1)B(0)—K} n, +1, ——— J, n, >0 


n,t+1 
x+n.x 1] (xtnx.) 2 
where K/ n, +1, oo? |= ( Xo) + log| —— |]. 
n,+1 2 n,+1 n,+1 


x+n,x, —c/2 


Hence a= 
n,t+1 


Remark 6.29. Thompson and Basu (1996) generalised the linex loss which they called ‘Squarex loss’, 
given by 
L(A) =e” +cA’ —bA-1; (b,c) >0, A=a-89, 


with Bayes estimate a satisfying the following equation 


res Grac } 
a= ai. F108 a ~a) | 
b b 


where a, is the Bayes estimate under linex loss function. 


linex 


Squarex loss behaves quadratically for large under-estimation errors and exponentially for large 
overestimation errors and includes linex as a special case. They also point out that squarex family of 
loss functions can potentially better approximate actual loss function than the linex loss family. 


Modified Linex Loss 


According to Basu and Ebrahimi (1991), when the parameter @ is a scale parameter, we may take A 
= (a/8) — 1, where a is an estimate of 0. They, therefore, define modified linex loss function 


a a 
L(0,a) =b 1 1 |-1}, : 
a lores § (6.21) 


where b > 0 and c #0. The Bayes estimate under modified linex loss is obtained by minimizing posterior 
expected loss and it is given by solving the equation 


e| 1 ca op 1 
—exp}| — |x |=eE] —|x |, 
9 p 9 9 (6.22) 


provided that all the expectations exist and are finite. 
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Remark 6.30. It is easy to see that for a scale parameter 0, the modified linex loss (6.21) reduces to 


2 a ‘ 
be F al , for small values of | c |. 
Example 6.31. Suppose X ~ N(0, 8) and g(6) « 1/8, so that posterior distribution of 8, given X, is 
Inverted-Gamma(n/2, ©x,7/2). Under the modified linex loss function (6.21), the Bayes estimate of 0 is 
given by solving the equation (6.22). 


Since, 
n 
E(0"|x )=—— 
(@\s)=2. 
ca/® co 2 \? 2 
e Lig. a») Xx | do 
and E|] ——|x | e : exp| ——— 
0 > 8 r n 2 6 a 
2 


c) 


T ae > \n/2 a 2\n/2 
: 2 =| 2 } _ n(2x, ) 


rx —2ca 5 aa 
(2x, — 2ca)? 
the Bayes estimate of 8, under modified linex loss, is 


2 


_ =x a 2eMnt2) 
a= - (1 e ) 


Example 6.32. (Kuo and Dey, 1990) Suppose X = (X,, X,,...,X,) is arandom sample from Gamma(2, 


8) and the prior distribution of 6 is Gamma(, 8). The posterior distribution of 6 is Gamma(o+n, B+ £x,). Under 
the modified linex loss function, the Bayes estimate of h(@) is given by 


e( =o omels Joe I 
h(6) h(6) 


where a is the estimate of h(@). 
In particular, if h(®) = 1/0, the Bayes estimate under modified linex loss is obtained by solving 
the equation 


x ) (6.23) 


E(Oe™ | x) = eE(0| x)- (6.24) 


. cad acO n-l _ —(B+ Xx; )0 
Since  E(0e BO" te "a9 


x) 


_ (B+2x,)*" f 
T(a+n) 


0 
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— (B+Ex,)""Tat+n +1) 
T(a+n)(B+ =x, —ac)*"" 


a+n 
and E(9| x)= ; 
B+ 


i: 


we have, 


1 c(+n+ 
a=—(B+¥Ex,)[1-G+Ex, emer J. 
Cc 


Remark 6.31. H.V. Roberts (1975) commented that a ‘fear of legal action makes auditors prefer to make 
conservative estimates and ...’ this implies the use of an asymmetrical loss function. In fact Varian’s 
(1975) motivation came from concern for the potential losses incurred by state government as a result 
of erroneous assessments of private homes. 


Entropy Loss 


Calabria and Pulcini (1996) defined generalized entropy loss function 


a a 
L(@,a)=b]| — } -clog—-1], c¥0, 


as a valid alternative to the modified linex loss. 
The Bayes estimate under the generalised entropy loss function is given by 


a=[E@*|x)], (6.26) 


provided E(6~“|x) exists and is finite. 
When c > 0, a positive error (a > 8) causes more serious consequences than a negative error. 
In particular, for c = 1, we have entropy loss function 


L@\<b| ss 4 
,a)=b} ——log——l |. 
5 Re (6.27) 


However, if 


a—0 lfa : 

7 =0, we have L(0,a) = ar *] . The Bayes estimate of 8 under entropy loss 
0 2\ 0 

function is obtained by putting c = 1 in (6.26) which is the posterior harmonic mean. 


For the negative values of c, i.e., c = —u (say), the form of the generalised entropy loss function 
reduces to 


6) A) 
L(60, a) = (2) —ulog——-1 (6.28) 
a a 
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0 0 
In particular, for u = 1, L(@,a) =——log ——1. The Bayes estimate works out to be posterior arithmetic 
a a 


mean. 


Example 6.33. Suppose X =(X,,X,,...,X,) is a random sample from Pois(®) and the prior 


distribution of © is Gamma(, B). The posterior distribution of 8, given x, is Gamma(o+Xx,, B+n). Since 


B(0'|x)= | 1 (B+n)"* e Bm ege a9 = B+n : 
7 6 T(a+Xx,) a+Xx, -1 


the Bayes estimate of 8, under entropy loss function, is 


1 ree» ceed | 


B(@ |x) Ben 


which is the harmonic mean of the Gamma(o+2x,, B+n) distribution. 
Remark 6.32. The linex, modified linex, and (generalised) entropy loss functions can be obtained from 
the general expression 


L(A) = be“ —cA—b; a,c#0, b>0, (6.29) 


a a 
by taking A=a-—0, A=—~-1, and A =log-—, respectively. 
0 0 


Remark 6.33. Calabria and Pulcini (1996) suggest that the value of the shape parameter appearing in 
linex, modified linex, and in the general entropy loss functions may be chosen according as 
(i) For the linex loss function, consider 


L(0+8,0) _ 
L(@-8, 6) 


where r is a measure of the ratio of the loss for over-estimation and the loss for under-estimation 
(the error being of the same absolute value but with different sign.) 
(ii) | When the loss function is either modified linex or general entropy, consider the ratio 


L(68, 8) 
eS Yr; 
L(0/8, 8) 


where r is the value of the ratio of the loss for an over-estimation of 5 times and the loss for an 
under-estimation by 1/6 times. 

They suggest that the investigator should decide about the value of r in advance and 
then accordingly search for the value of c which satisfies the above criteria. 


tr; 6>0, 


5> 0, 


6.6 INTRINSIC LOSS FUNCTIONS 


In some situations, we may neither know the loss function nor the natural parametric form of the 
distribution. In such non-informative situations, we may consider loss functions that compare directly 
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the distribution f(. | 6) and f(. | a) associated with the true parameter 6 and its estimate a. The loss 
functions of the form 


L(0,a) =d(f(.|6),f(.|a)) (6.30) 


are parameterization free. Here d(-,-) represents a distance function. In the literature, we find that 
Entropy and Hellinger distances are commonly employed to construct the loss functions. 


Entropy Distance 


One of the usual distance functions for distributions is the entropy distance (ED) 


d(f(.| @).f(.|a)) = re) 
.| 9), f(.| a)) = E} log : (6.31) 


f(x |a) 


where expectation is taken with respect to f(x | 8). It is also known as the Kullback-Leibler divergence 
measure which is, infact, not same as the Euclidean distance function. 
Example 6.34. Suppose f(x | 6) = 6*(1—6)'*, x = 0, 1. The ED loss function is 


Gitar) 
L(0,a) = E} log} ——— 
f(x | a) 


= E(X log 6+ (1— X) log. — 8) — E(X loga + (1— X) log(1—a)) 


6 1-0 
= E(X) log} — |- Ed — X) log 
a l-a 
6 1-0 
= Blog (1-9) log (6.32) 
a l-a 


which is known as “logarithmic loss function”. 
Example 6.35. Suppose the rv X has pdf 


-O/yx 


e 
f(x |®) = 


, x=0,1,...; 0>0. 
x! 


The ED loss is 


fall wel ON eel 2 toe 4 
,a) =E| lo = 6] —-log—-1 |, 
Hee 5 8s (633) 


which is the weighted entropy loss function for the Poisson mean 9. 

Remark 6.34. The Bayes estimate under the loss function (6.32) works out to be the posterior mean. 
It may be recalled that the posterior mean is the Bayes estimate of 8 under the SELF as well. 
Remark 6.35. We shall come across a number of situations in which two or more different loss 
functions may result in the same Bayes estimate. 

Example 6.36. Suppose X ~ N(0, 6”). The entropy distance loss for 6° is 
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f(x| 0’) 1 > E(X*) 1 E(X’) 
L(o’,a) =E} lo = lo 
( ) . e 207 2 2a 


f(x|a) 


l{fo o 
- lo 1]. 
( , a ) 


In this case, the Bayes estimate of o* is the posterior mean, i.e., E(o° | x). 


Remark 6.36. Entropy loss provides explicit estimators for the estimation of the natural parameter 0 
in the canonical form of the exponential family with pdf 

f(x | 8) = c(®) h(x) exp(6x). 
Example 6.37. Let us consider X ~ N(O, 0”). The natural exponential family representation for 


f(x|o’)= exp(-x’ /20°), 
2n0* 
x 1 
is obtained by putting y=—-— and —=0. 
2 Oo 


Thus, 9 = 1/o? is the natural parameter. The loss function will now be given by 


f(y | 9) 1 1 
E| log ——— |=E| —log@+ ®0y |—E| —loga+ay 
f(y |a) 2 2 
1 1 -1 -1 
=—log§-——loga+0} — |-a| — 
2 2 20 20 


lfa a 
=—| —-log——1 }, (6.34) 
2\ 60 0 


where a is the Bayes estimate of the natural parameter a(= lo ). 


Since (6.34) is the entropy loss (6.27), the Bayes estimate of the natural parameter 0 (= lo ) 5 


under entropy distance loss, is the posterior harmonic mean. 
Example 6.38. Suppose X ~ N(0, 1). The ED loss function is given by 


E| log =E +0x |-E + ax 
f(x |a) 2 2 


er 1 2 
=-—(0 +a —2a0) =—(0-a) (6.35) 
2 2 
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It is not surprising since 


10 = pon a 0 | 


= c(0)h(x) exp(0x) 


for 


1 x” Q 
h(x) = exp and c(0) =exp| —— |. 
V2n [ 2 ( 2 


Therefore, 6 is the natural parameter. The Bayes estimate of 8 is posterior mean. 
Hellinger Distance 


The distance function 


1 f(x |a) 
d(f(-| 8), f(-|a)) = i ] (6.36) 
2 | Vrolo 


is known as the Helligner distance (HD), where expectation is taken with respect to f(x|@). 
Example 6.39. Suppose X ~ N(@, 1). The HD loss is 


1 = : 
Li8.a)= "| exp] (a —0 —2x(a op | 
2 4 


1 1 aD) 2 1 1 2. 2 
=so0{-30'-8 Plex ®)))+5 eo a% #}e{ex{ 5-0] 


hd ae as (a-6)" ) 1 Le a a-0), | (a-0) 
=500 a © poo 6)0+ 5 5 eo rhs ©) fo ; pe ) 


Hence, 


1 ; 
L(6,a) -1ap(-f-8" | (6.37) 


The Bayes estimate of @ is given by solving 


e F oe (a—8y" Jo 
da 8 
or ro a) eof 6 = | =0 
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@—a)? @—ay? 
or Fe ao} aE feu Oo } (6.38) 


where expectations are taken with respect to posterior density of 8. Since posterior distribution of 0 


n+l n+l 
= 2 
ee (@-a) }} n+l J «x | (O@-a) Sle nx jf 
\ 2n 2 n+l 
4(n +1) n+l ( nix i 
exp a , 
4n+5 2(4n +5) n+l 
pa (0-a)? n+l nx ) 
and E(oen[ *)) al Oex | 7 [« 2) Jo 
fam +) a+4nx n+l nx ) 
— exp a 
4n+5 \ 4n+5 2(4n +5) n+l , 


Thus, the Bayes estimate a is obtained by solving 


nx 1 
is N - when 8 ~ N(0,1), we have 


a+4nx 


4n+5 


we get 


=a, 


nx 


a= : 
n+l 


Remark 6.37. Note that the Bayes estimate is the posterior mean (median or mode). It may also be 
noted that the loss (6.37) is an increasing function of |@—a| and g(8|x), which is normal, is symmetric 
and unimodal. Thus by Exercise 38 (Berger, 1980, page 161), the Bayes estimate is the posterior mode. 
Remark 6.38. Spiring (1993) introduced reflected normal loss function of the form 


(8-a)" 
L(0,a) = HH 7 }} (y,k) > 0, (6.39) 
Y 


which is obtained by using simple transformation of the normal density function. 


Laila Mohammadi (2003) suggested that we may obtain a variety of other bounded loss functions 
by considering simple transformations of density functions. For example, if we use Gumbel density, 
we get 
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La) =r" on = | oa a il 
5 [-S] a-0 } 
=k]|e —expjexp| — —-— lf]; (6.40) 
B B 


where B > 0 and k > 0. Note that this loss is a monotone function of linex loss for c < 0. Thus, ignoring 
the factor ke"! and writing c = —1/B, we get 


L(6,a) = 1— exp [—exp{c(a = 6)} +c(a—8)+ 1]; a #0. (6.41) 
Remark 6.39. Schibe (1991) defined Quasi-quadratic loss function as 


L(0,a) = (e* See" ) ; c#0. (6.42) 


It can be easily seen that the Bayes estimate of 6 under quasi-quadratic loss function works out to 
be 


ce ie [E(e*|x)], (6.3) 
c 


which is also a Bayes estimate under Varian’s linex loss. This loss function cannot be expressed in 
terms of (a — 9) or (a— 9) / 9. 


6.7 BALANCED LOSS FUNCTION 


A balanced loss function (BLF) was formulated by Zellner (1994), to reflect trade-off between 
goodness of fit (or lack of bias) and precision of the estimation. Suppose x,, X,, ..., X18 a random 
sample satisfying the relationship X,= 6+ u,, i= 1, 2, ..., n, where 9 is the common mean of the X,’S 
and u, is the error term. A balanced loss function for 8, denoted by L(®, a), where a is some estimate 
of 0, is given by 


@ 2 2 
L(@,a) =— (x, - a)’ +(1- @)(a - 8) ; O<s wo<l, (6.44) 
n 


The first term on the right hand side of (6.44) represents goodness of fit, while the second represents 
precision of estimation. 


Result 6.9. The Bayes estimate of @, under BLF (6.44), is given by a=@x+(1—@)0, where 


6 = posterior mean. 
Proof. Let us write 


N j= 


L(0,a) = (Ee —a)’ } (1—@)(a—98)* 


= 0(6° +(a-%)’) + (1-@)(a-6) 
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ly “Xx. 
where é=-y (x,-x), xa), — 
nN j= i=l n 
Assuming that the posterior distribution of @ has finite second order moment, the posterior 
expected loss under BLF is 


E(L(@,a)) =E| o(6" +(a-X)’) +(1-ay(a- 6)" | 
= 06° +@(a—X) +(1-@)E(a-6)’, 
the expectations are being taken with respect to posterior distribution. Since 
E(@-a) =(a—@) +E(a-6), 
we have 
E[L(0,a)] = «(6° +(a—x)*) + (1-@) ((a— 8)’ + E(0- 8)’ ) 
=(a—a ) +@(1-@)(0 — x) +@6° +(1-@)E(@- 8)’ 
where a =@x+(1-@)0. 
Thus, the posterior expected loss is minimised when a=a . 


Remark 6.40. If @=1, then a=X and if @ = 0 then a = 6@. In particular, if x happens to be the 


posterior mean, then a = x = @. (it will happen when we are considering non-informative prior for the 
normal mean). 


Remark 6.41. Let us rewrite a = o(X —9)+ 0. If X > or< 0, thena> or <0. 
Remark 6.42. If the posterior mean is a linear function of x , that is, @=cx+ (i—c)6., where 0<c<1 
and 8, is the prior mean, then 

a= Mx +(1—@)cx +(1—@)(1—c)6, 


=(@+ -@)c)x +0 —@)(1—c)6, 


Note that, for 0 < @ < 1, sample mean has a weight @ + (1 — 0)c in the BLF estimate, whereas, sample 
mean has weight c when we consider the Bayes estimate as posterior mean (under SELF). Since 
@ + (1 — @)c > c, the BLF estimate provides more weight to the sample information than that given 
by the posterior mean as the estimate. 

Remark 6.43. The BLF estimate has a smaller posterior expected loss than the posterior mean as the 
estimate, since 


E(L(®, 6))-E(L(@,a))=(@-a)° 
=(0-wx-(1-@)6) 


=o (x-6) >0 


Bayes Estimation 189 


So, the BLF estimate a also has a smaller risk than that of the posterior mean. 

Same result holds for the sample mean. 

Example 6.40. (Chung, Kim and Song, 1998) Suppose X,, X,, ..., X, is a random sample from a Poisson 
distribution with unknown mean 9. Consider the prior for 8 to be Gamma(q, B). Then the posterior 
distribution for 0 is Gamma(a@ + Xx,, B +n). The Bayes estimate of 0, under BLF, is 


a= x + (1—- @)E(8| x) 


+Xx, 


=x + (1-—@) 
B+n 


nO os (1—@)a 
x 


B+n B+n 


Weighted Balanced Loss Function (WBLF) 


The weighted balanced loss function, suggested by Zellner and Rodrigue (1994), is an extension 
of the BLF to reflect both goodness of fit and precision, is given as 


Or 2 2 
L(0,a) = 2] yx a) +(1—@)(9—-a) | (6.45) 
Nn i- 


where 0<@<1 and q(8) is any positive weight function of 8. WBLF reduces to BLF for q(@) = 1. 
Result 6.10. The Bayes estimate of 8, under WBLF, is given by 


E(6q(8) | x) 
E(q(8) | x) 


Proof. The Bayes estimate is given by the solution of the equation 


a=@x+(1-@) 


re) 
—E[L(@,a)|x]=0. 
da 
Remark 6.44. Note that the Bayes estimate retains the linearity property of the weighted BLE In other 


words, if L,(8,a,) = q(®))) (X, - a.) /n and L,(8,a,) = q(®))” (0- a,) , then the Bayes estimate 
i=1 i=l 
under 
L(O, a) = oL,(8, a,) + (1-@)L,(, a,) 
is 
a= @a, + (1—-o)a,. 
Example 6.41. (Example 6.40 continued) — Following the suggestion of Shapiro and Wardrop (1967), 
choose q(9) to be Fisher’s information, i.e., q(®) = 1/0. Since 


1 a@+Ex,-1 


a, =X and a,= = 
i E(1/8| x) B+n 


1 
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the Bayes estimate of 8 under WBLF is 
a+nx—1 


a= Ox +(1-@) 
Btn 


n+Bo- (1-@)(a—-1) 
= xt+ ‘ 


B+n B+n 
Remark 6.45. If we take Jeffreys’ prior g(®) « 6” for 6 in the above example, then the Bayes estimate 
of 8 under WBLF for q(8) = 1/0 is a= x —(1—@)/ 2n. 
Example 6.42. Suppose X,,X,,...,X, is a random sample from Bernoulli(®), with 6 unknown. Consider 
the prior for 8 to be Beta(a, 8). Thus, the posterior distribution for @ is Beta(a+£x,, n—Zx,+B). Consider 
q(®) = 1/6(1—6), the Fisher’s information for 8. Since a, = x and a, = (+x, -1)/((a+B+n- 2), 
the Bayes estimate of 8, under WBLF, is 


a= @x +(1-@)(a+ nx -1) /(a+B+n-2) 


(a+B-2)M@+n_ (1-—@)(a-1) 
= xX+ 


a+p+n-2 a+Bp+n-2. 
Remark 6.46. If we take Jeffreys’ prior g(@) «(6 (1-6))"'” for 6 in the Example 6.42, the Bayes estimate 
n-@_ (1-@) 
x : 
n-1l 2(n-1) 


of 9 is a= 


6.8 MISCELLANEOUS EXAMPLES 


Estimation of reciprocal of a parameter 
Suppose, Xx, X,, Livy X, is a random sample from N(u, 6”), where 6? is known. We wish to estimate 
6 = 1/u. Consider relative squared error loss function 


@-a ) 
L(8,a) =| —— |, 
ce 


where a is an estimate of 8. Note that, this loss function represents the loss to be more serious when 
the true value of 6 is small for given absolute error |@ — a] than when it is large. The SELF on the other 
hand represents the same loss irrespective of the true value of 0. 


5 Q-a ) 
Relative SELF can be obtained by taking €=1—ap., so that € = (1—apt)’ = S) . Here the 


quantity € measures an error when a is substituted for 8. On minimising the posterior expected loss, 
the Bayes estimate a of 0(= 1/1) is given by 


Bub) tt (,, Varw ly) 
EW |x) E@|x)( ((u|x))’ 


(6.46) 
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Note that the second term on right hand side depends on the squared coefficient of variation of the 
posterior pdf of uw. If the posterior pdf for u is highly concentrated around its mean, the Bayes estimate 
a will be close to (E(u|x))'. However, if the posterior coefficient of variation of pu is large then 


a|< EM)! Zellner (1978) called this estimation procedure as minimum posterior expected loss 
E(u|x) 


(MELO) estimation. 
In particular, if the prior distribution of wu is Jeffreys’ non-informative prior 


g() constant, —co<[1<co, the posterior pdf for pL, g(x, 6”) is N(x, o /n). Thus, the MELO estimate 


x nx 


ae el 

1 om 

of uw! is given by a= :(' + = . Note that Z is the maximum likelihood estimator for 8 = 1/u 
xX x 


and it is well known that the moments of do not exist. Zellner (1978) has shown that the moments 


<I | — 


of the MELO estimator are finite. 
Estimation of Coefficient of Variation 


Example 6.43. Suppose X,, X,, ...., X, is a random sample from N(u, 6°) and the joint prior distribution 
of Lt and o is non-informative, given by g(U, 6) = 1/o. The joint posterior distribution 


1 n+l 1 es -_ | 
swaly=(~ | exp - (X(x, -x) +n(-7)] me he (-ce,00),6° > 0, 


20° 


n-l 


2: 


T 


where m(x)= | [ 2,0] x)g(t1,0)dude = rere 
“= Vin(2m) ? 28? 


Our interest is in the estimation of coefficient of variation o/LL. 
(i) | Zellner’s MELO approach 


If 6 is the MELO estimate of @ = o/1, then writing 6 -o = €, we should minimise posterior 
expected loss 


E| Gu -o)' |x ]=E[n'@-0)" |x]. 
The MELO estimate is given by 
E(®u’ |x) E(ou|x) 


6= — 
E(u’ |x) E(u’ | x) 


(647) 
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Since, E(on |x) =| | Log (lL, 6 | x)dudo 


z ro -S 
= r i ~exp| — |do 
n o” 2 


m(x)(2m) ? Vn 


=xvor(5-1)/r(**} s=-) (x, -x)° 


Further, the marginal posterior distribution of p is t-distribution with (n-1) df, mean x, and variance 


L(x, - x)" 
——__ ~ , we have Bayes estimate of coefficient of variation 
n(n —3) 
n 
(flat 1 (eyT 
x n 2 Oo 1 fo} 
2-6-2 i “fi. (2) | (6.48) 
n- _ 
2 r X n—-3\ x 
2 


where 6? = 15G -x). 
n 


(i) Let us consider the entropy loss function 


a, a, o 
L(0,a,)=——log——1, where 9=—. 
0 0 uu 


The Bayes estimate is posterior harmonic mean. Since 


-| j = s(t. | x)dhudo 


Hf 


-1 


m(x)Vn (2m)? 28? 


> 


the Bayes estimate under Entropy loss 
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n-l 


T 
i : 
a, =4/— 
2 2 
2 


a 2 
(ii) If the loss function L(@,a,) = fe] , then the Bayes estimate is 
: ) 


>| | & 


(6.49) 


m 
e(o'}x) #55] 
a,= — = : 
E(o"|x) an 
3? 
n+l n-l 
Since (i :} = x = + 7 F 
o |” m(x aoe — n — 
© fom? 28? 25? 


came | pe 4 
= 2 2 6 {+ 6 = 
rf) = mae ee (6.50) 
2 


Remark 6.47. If we use the approximation 


Cc 


gl (c+d) ig OD nay “) 
— Cc 3 
T(c) 2c oe) 


the Bayes estimates of coefficient of variation under the three loss functions are given by 


6 [on 3 | 2 IH 1 (2)] 
1+ +O 1+ —_ 
x Vn-1l 4(n-1) n-1 n-3\x ‘ 


i) 
Il 
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é fn 3 ( 2 KK 1 (2\} 
a,= 1+ +0 1+ = ‘ 
x Vn+l1 4(n + 1) n+1 n-l1\x 


All these three estimates tend to the classical maximum likelihood estimate, namely 6/x , of the 
coefficient of variation as n > oo. 

Remark 6.48. The MELO estimator and the estimator under the relative SELF have a shrinking factor 
depending on the square of the classical estimator of coefficient of variation. 


Estimation of Binomial Parameter n 


Electrolux Company wishes to estimate the number of washing machines in use in a certain 
service area. Suppose further that the company believes that the weekly total of defective washing 
machines sent in for repair (irrespective of age) arises with a binomial probability p about whose value 
they have some prior knowledge. The number of defective washing machines X received during a 
routine week could be used to give an estimate of the population size n. In general, if we have a 
characteristic with binomial behaviour and only the successes (or failures) become apparent, we can 
use these to obtain information on the population size. 

The problem of estimating n has a long history, dating back to the works of Student, Fisher, and 
Haldane. Olkin et. al. (1981) showed that the maximum likelihood estimator as well as method of 
moments estimator were unstable. Feldman and Fox (1968) have reviewed the literature on estimation 
of n based on M.L.E., M.V.U.E. and M.M.E. Carroll and Lombard (1985) used a beta prior distribution 
for p to modify the estimates. Draper and Guttman (1971) proposed a Bayesian approach in which the 
prior distribution of n was uniform on a set 1, 2, ..., N. Hamedani and Walter (1988) considered Bayes 
estimation of Binomial parameter n based on a general prior distribution. 

In this section we shall discuss Hamedani and Walter (1988) approach for estimation of n. 


Suppose X =(X,,X,,...,X,) is a random sample from Bin(n, p) and X = (X,,X,,+5X,) be their 


observed values. 
Case 1. p known 
Suppose the prior distribution of n be g(n). The likelihood function of n, given 


1d 


=(X,,X,,.5X,), is 


E n 
£(n| x) = I] pd=p) 7) Apkeenk, Sm or ty Sn, 
i= xX, 


Therefore the marginal pmf of x is 


00 


m(x)= )* nm] x)g(n), 


BoM ey 
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where X,,. is the largest order statistic, and the posterior probability mass function of n is 


k n _ _ 
g(n| x)= i [ Jr (l-p)'""’ g(n) js. 
isl Xx, 


The Bayes estimate of n, under SELF, is given by the posterior mean 


f= E(n |x) 


a k n 
=x, + >. o-n HTT p (—p)"™ co} [c9 
ae icp Ne 
% n! es a n |. a 
=x, t+) @-x,) p’d-p)” [] p'(-p)’ “g(n)¢ /m(x) 
n= Xu) (n-x, )IxX,! ae X, 


1- = n Xp) 41 n-X;,)—1 n x, n-x, 
ny 6, 00( 25 } »"ad-p)"?" [] jro-o sco} /neo 
Pp MEXR) Ky #1 Xi FX (Ky X; 


: l-p 
n=s, sy 0f PJrosn : eae an +1)/m(x) (652) 
Pp 


where (X,,,,X(.)>»X,,,) is the value of order statistics vector for observed vector (x,,X,,...,X,)- 


In particular, for a single observation, i.e., k = 1, we may substitute Ky =x in (6.52) to get 


éoece gays ase (6.53) 
p m(x) 


Ann 
eA 


(a) Suppose, g(n) = ee n=0, 1, 2, ..., we have, 
n! 


n=x 


co n my a 
mx) =) prope 
xX nN! 


=o eb" 5 [a-p)A] 


x! jo m! 


-ap (pa)* 
=e° 


x! 


x =0,1, 2... 
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which is Poisson with parameter Ap. 
Hence, the posterior distribution of n is 


wha n -2p ay" 
soln|[' I aera 
n! x x! 


—h(1-p) n-x 
e Ad - 
7 OOP an 
(n-x)! 


Thus, for fixed x, (n—x) has Poisson distribution with parameter A(1-p). 
Under SELF, the Bayes estimate of n is given by 


E(n —x|x) = A(1 -p) 
Hence, n=x+A(1-p). 
Aliter: We may use (6.53) to obtain 


l-p \e ?(Ap)"" Ax+D! 
p je a, (x+]) ee a 
p e "(Ap)'/x! 


(b) However, if we take the prior distribution to be the discrete uniform over the set {1, 2, ...}, 1.e., 
g(n) = 1 for all n, then on substituting 


<a. aus P 1 
m(x) =) Jiro p= =—-; x=0,1... 
n=xX Xx 


(l-(-p))"  p 


saxt oct 


and m(x + 1) = I/p in (6.53), we get 


x 1- 
fll) 


p p 


Case 2. Both p and n unknown 
Assume that p and n are a-priori independent of each other having marginal priors for p and n 
as Beta(a, B) and Pois(A), respectively. Then 


m(x)=)" | f(x|n,p)g(n, p)dp 


n=0 9 


= fy f(x |n,p)g,(n)g,(p)dp 


0 n=0 


= {mo |p)g, (p)dp 


0 


Since m(x|p) is m(x) of Case 1, we have 


ap (Ap) p* "(=p)" : 
x! B(aB) 


m(x) =| e 
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x* 1 O+x— | B-1 
~ fe 32D) a, REO a, 
Lo B(Q, - 

and 
n aa Ge a-l 1- B-1 

smalo-|[ pre oo Jno. 

x n! B(a, B) 

So, 


7 ee a a= p)” x+B-1 |; 
g(n| x)= dp | /m(x) 
Pea aa 


_ e’A"* B(atx,n—x+f) 


dp, n=x,x+1.,.... 


_x)yt! 
(n ball Par in| 


0 


1 
If g(n) =1 for all n,m(x|p) = —, and 
P 


r 1p’ (l-p) Orne! 
m(x) = | m(x| p)g,(p)dp = J : a ; - a | 
0 > P - 


n i py -( 1 i x+B-1 a+B-1 
So, = 
> acto") ee) 
-(" poe 7 a-1 ) 
———_ }; n=x, x+tl,.... 
B(q, B) a+B-1 
aa @ [a ay =| 
[eaer r ("} on ]/(222) 
B(a,B) a—1 


__a-l p™'d-pt(1 
a+B-1 BOB) |p 


Hence, n=E,E(n|x,p), 


and 


g(p| x)= 


where E, is the expectation with respect to the marginal posterior distribution g(p|x) of p, given X. 
Thus 
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ime [xete a motile] using fin (6.51) 
Pp m(x | p) 


i [= 
=x+(x+pf [—* oh vee 
Pp 


0 


=x+(x+) = BROAD) ag ees B pg, B 
a+B-1 B(a,B) a-2 a-2 a-2 


a+ 28-1 B 
= x+ 
at+B-1 a+B-1 


Tliopoulos (2003) revisited the problem of estimation of n (p known) when the parameter space 
is more realistic. He considered the parameter space to be the set of positive natural numbers and took 
prior distribution of n to be 


eA 
g(n) = 2. m1 2.2. 
(n-1)! 


The Bayes estimates of n under the loss functions 

(i) Lm, a) =(a-n)/n 

Gi) Lm, a)=(a-n)y. 

may be obtained as follows. 

The Bayes estimate of n under L, is the posterior harmonic mean, Le., 


Since, the posterior distribution of n is 


g(n|x =0)=[q'e*A"' (n-1)!] » g’e A" (n-1) 1 


n=l 


=[(g)"" An -1)!] » (Aq)"" (n-1) 1 


-Aq n-l 
el(A 
oe ace. 
(n-1)! 


n Ss ar sé n — aaa eA 
sain=[(" ps lz [" =| 


For x21 
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a - 7 a od A a ame 
-|(" mp ["} | 


n Ag)’ *e 
= (4) ;x2Ln=x,x+h.... 


x+Aq (n-x)! 
Thus 
e “(Aq)” (n-1)! n=1,2,....:x =0 
g(n| x)= 
e "(Aq)" aa, KE x,xtl1..ix 21 
(n —x)!(x +Aq) 
and 
1 \ dq n-l 
E| —|x=0 |=) e“(Aq)"'/n! 
n n=l 
i< 1 
=—VY e™(Aq)'/n!=—(-e™), 
Aq n=1 Aq 
for x 21, 


eS (Aq)" 1 
xX |= = : 
x+Aqmo m! x+dAq 
Hence the Bayes estimate of n is 
Aq(l-e“")' if x=0 
x+Aq if x21. 
(ii) The Bayes estimate of n under SELF is 


% “M4 Ag) 
E(olx=0)= 5, SO asia 


(since the posterior distribution of n, given X = 0, is Pois (Aq) defined for n=1,2, ...) and for x 21, 


E(n|x)=)) us e™ (Aq) 
nn «6X +Aq (n—x)! 


i ay” 
= y (x+m)e mm (AQ) for m=n-x 
x +Aq ino m! 
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[ x’ + 2xAq+Aq+t rq’ | 


xt Aq 
vy 
=x+Aq+ a 
x +Aq 
Hence, the Bayes estimate of n under SELF is 
1+Aq if x =0 
E(n|x)= 
+Aq+ if x>1. 
x +Aq 


Capture-Recapture Problem 


The capture-recapture approach is often used to estimate the population size. This approach relies 
on taking two successive samples from the population of interest. It is also used to estimate population 
size of nomads, homeless people, or illegal immigrants besides the population of animals. Mosteller and 
Wallace (1964, 1984) applied the capture-recepture technique for author identification by linguistics when 
the origin of some literary work is uncertain. Smith (1991) lists various applications in areas such as 
epidemiology and demography. 

The type of method used for estimation depends on the nature of population investigated, 
namely whether it is closed or open. Closed population is one that remains effectively unchanged 
during the period of investigation, while an open population is one that may change due to birth, death, 
or migration. Seber (1986, 1992) and Schwarz and Seber (1999) review the literature on estimation of 
animal population. A historical overview of the subject is given in Pollock (1991). 

A capture-recepture census involves catching and marking a sample of R ‘items’ and returning 
them to the closed population of size N. The remaining S(=N-—R) items of the population remain 
unmarked. After allowing marked and unmarked to mix, a second representative simple random sample 
(without replacement) of n items is then drawn. This sample yields r captured marked items and 
s (=n—r) unmarked. 

Suppose that there is an unknown probability 0 of capturing each item independently of each 
other. Then the probability 9 of capturing n items from a population of size N is given by the binomial 
distribution 


R+S 7 
foin.9y=[ pone s 0<6<1, n=0,1,..,R+S(=N)- (6.54) 
n 


The probability of recapturing r marked items out of n, conditional on R, N, 9, and n, is given by the 
hypergeometric distribution 


R\YS 


T S 


R+S)\" 
r+s 


f,(r|R,S,0,n) = (6.55) 
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Since, S and 9 are unknown, the likelihood function of S, 0 is 


&(S,8|n,r,N) =f,(n| N, ®)f,(r| R,S, 8, n) 


RS 
= I p (l = jr (6.56) 
T S 


Note that r and s are known if, and only if, r and n are known. 

Here S is the unknown parameter of interest since N = R + S and R is known and @ is a nuisance 
parameter. Let us assume that the marginal prior distribution of S is discrete uniform over the non- 
negative integers and that S and @ are a-priori independent of each other. Let us further assume that 
the prior distribution of @ is Beta(r,, R, — r,), where 0 < 0 < I and 0 <r, < R,. On using the Bayes 
theorem, the posterior distribution of S and 6@ is 


S 
g(S, 6 | r,,R,,8)« pr (fe a ge (i= g) ere 
S 


ee fe ae ss 
(o** (1-6) ae) \( be (1-6) Jos [0,1],S=s,s+l... (6.57) 
S 
Hence 


2(S,0 


1,,R,,8)=g, (O[r,R, )g, (S+1]6,,8), 


where the marginal posterior probability density function g, (0 


r,,R,) is Beta(r,-1, R,r,). The 


conditional posterior distribution g, (S +1 


0, s) is Pascal pmf with parameters 8 and (s+1) which is the 


probability that (S+1) Bernoulli trials are required to obtain (st+1) successes with probability 6 of 
success in each trial. 
The marginal posterior mass function of S+1 is 


g(S+1|1,,R,,s)=| g(S,|1,,R,,s)d0 


0 


Ss % +s-1 R,+S-(1, +s)-1 
x] |f ee 'a-ey*"""ae 
Ss 0 


S\B R - 
Thus g(S+1|1,,R,,8) = Geeks oa s=s,stlL...,s20;l<r, <R, 
s Ba, -1,R,-1) ° 


which is the pmf of Beta-Pascal distribution. 
Since 


E(S+l|r,,R,,s) =) (S+Dg(S+1|4r,,R,,8) 


S=s 
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i ee qd _ yo dé 
y S+1 

=(s+1) 0 

= (stl) B(,-1, R,-1) 


(s +1) 


| -(S+l1 
7 fora-or | yy mq — ey [a0, 
Ba, -1, R,-4), sss \ SH+1 


Ba, —2,R,-14) R,-2 
= (s +1) — ——— = (s+]l) - , 
Ba, -LR, -41) E-2 
7 S+1 
since > en "1-@)* =1. 
= stl 
We have, under SELF, 
N=E(N\r,,R,,s) 


= K(S+1|1,R,,8)+R-1=(s+1) 


+R-1. 
r—-2 


3 


Remark 6.49. Alternatively, we can obtain the E(S+1) using iterative expectations, that is, 

E(S+1) = EE(S+1]6), 
where the first expectation on the RHS is with respect to beta distribution and the second is with 
respect to the Pascal distribution (for ease of notations, let us suppress the condition i; R,, s in 
expectation). 


stl 
Since E(S +1| 6) = —— and 
6 


| 9g?! (1 ar 9)" dé 
stl 
E]| — |=(s +1) 
0 Bir, -1,R,-14,) 


R,-2 
ate ; E 


(rt, - 2) 


(R, —2) 
Therefore, E(S + 1) = (s + 1) ————. 
rt —2) 


2 
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Example 6.44. Roberts (1967) considers estimation of number of fish in a pond by Capture-recapture 
method. For R= 41, r= 8, s = 24, R, = 25, r, = 2, the Bayes estimate of N, under squared error loss 
function, is 


‘ 64 
N =414 25] — |-1= 240. 
8 


However, the Bayes estimate of N is 186, 194, 202 when the prior distribution of 8 is Bayes-Laplace 
uniform, Jeffreys, and Haldane’s nil-prior, respectively. 


Remark 6.50. The classical maximum likelihood estimate N of N is R+R(s/r) which is 164. Bansal 
and Ganji (1997) observed that the posterior mode with Haldane’s nil-prior is 161 which happens to 
be approximately equal to the classical maximum likelihood estimate. Chapman (1951) reported that 


; _ (R+)(n +1) ; ; 
an approximate unbiased estimate N, of N is ——————— -] which works out to be 153. It is 


(r+1) 


interesting to note that the Bayes estimate of N, based on posterior mode with Bayes-Laplace uniform 
prior for 9, is also 153. 

The above example suggests that the Bayes estimate of N depends not only on the choice of 

the prior distribution but also on the choice of the loss function. We may recall that in a decision 
problem prior and loss functions have duality between them. 
Example 6.45. (Lehmann, 1983, Page 92) Suppose that a lake contains an unknown number N of some 
species of fish. A random sample of size R is caught, tagged, and released again. Somewhat later, a 
random sample of size n is obtained and the number x of tagged fish in the sample is noted. If, for 
the sake of simplicity, we assume that the each caught fish is immediately returned to the lake (or 
alternatively that N is very large compared to n), then n fish in the sample constitute n Bernoulli trials 
with probability 8 = R/N of success, i.e., obtaining a tagged fish. The population size N is, therefore, 
equal to R/0. 

Note that reliable estimation of 1/6 is difficult when 9 is close to zero, where a small change of 
6 will cause a large change in 1/0. Let us use Zellner’s MELO approach to obtain the estimate of 1/0 
and hence obtain an estimate of N. 

Consider X ~ Bin(n, 9) and the take prior distribution of 8 to be Beta(a, 8). Since the posterior 
distribution of 8, given x, is Beta(a+x, B+n—x), the MELO estimate of 1/0 is 


am | -l 
1 Var(9| x a+B+n +n-x 
1+ S| : = B [. B f 
E(®| x) (E(8| x)) O+x (A+ x)\(a+P+n+1) 
Hence, the estimate of the population size N is 


<i 
nm a+pt+n +n-xX 
N R B c B : 


7 (a+x\(a+P+n+1) 


Melo 


Q+x 


-1 
Z Rn n-X 
In particular, if we take Haldane’s Nil-prior for 8 (@ =B = 0), N,,,, simplifies to at + . 
a x x(n +1) 
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For the Roberts (1967) example, where R = 41, n = 32, x = 8, 


a 41x32, , 32-8 eee 
ar. 8x33) 


However, under SELF 


x 1 a+B+n-1 
N,,, = RE| —|x |= R}| ————— 
0] at+x-l 
= R(n-1)/(x-1); x > 1 anda=B =0. 
= 182. 
Remark 6.51. Lehmann (1983, page 92) suggests an alternative method. In order to control the variation 
of 1/0, for all 8, sometimes it is necessary to take more observations when the value of @ is small. 
Inverse sampling scheme achieves this. Under this scheme, sampling is continued until a specified 


number of successes, say m, have been obtained. Let y + m be the required number of trials. The 
random variable Y has the negative binomial distribution 


mt+y- 
Py |= 
m-1| 
If we suppose that 6 has a Beta(a, B) prior, then the posterior g(6ly) is a Beta(at+m, B+y) distribution. 
Under the SELF, the Bayes estimate of 1/0 is 


pe 6): y=0,1,... 


1 


E| -|y 
3) 


In particular, for Haldane’s prior for 0, 


B(a+m-—1,B+y) 
B(a+m,B+ y) 


a+B+mt+y-l 


a+m-1 


1 m+y-l 
E y |= : : 
0 m-l 
and under Bayes-Laplace uniform prior, 
1 m+ytl 
E y |= ; 
0 m 


If we use Zellner’s MELO approach, the Bayes estimate of 1/0 is 


sae Bty ). 


a+m (a+B+m+y+1\(a+m) 
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Determination of sample size 


Let us define C(@, x) as the cost of observing the value of x when 0 is the value of the parameter. 
If g is the prior pdf (or pmf) of ® then the expected cost of observation is 


ELC x=] | CCO,xyF(« | @)g(@)dxa0. (6.58) 


The total risk of observing x and using a decision function 6 may be defined as the sum of the risk 
r(g, 6) and the expected cost of observation E[C(8, x)]. In many problems, the cost of observation, 
however, depends only on the size of the sample. In such situations, the statistician can select the size 
of a random sample in advance by minimizing the total risk. 

Example 6.46. Suppose that X,, X,, ...., X, is a random sample from N(9, r), where the precision r is 
specified. If the prior distribution of 8 is N(w, Tt), then the posterior distribution of 8 is 


TH + nx ; ener 
= ,T+ nr } Suppose we wish to determine the sample size n which minimises the total risk 
T+ nr 


under the absolute error loss L(®, a) = |6 — al. The Bayes estimate of 0 is the posterior median which 
is equal to posterior mean for the normal case. Hence, the posterior expected loss is 


2 1/2 
E x J|=| ——— | , 
Se 


1/2 
2 
since E| Z |= eS for Z ~ N(0, p), and the posterior distribution of [« - 
pt 


TU + nrx 
ot nek 


T+ nr 


TU + nrx 
———_] is N(O, T+nr). 
t+onr 


2 1/2 
Hence, the Bayes risk with respect to the N(u, Tt) prior is | ————— . (Since the posterior 
(T+ nr) 


expected loss happens to be independent of X). The total risk is, therefore, 


5) 1/2 
T(n) = | ————_ +cn, 
(T+ nr) 


where c is the cost per observation. On differentiating T(n) with respect to n and equating it to zero, 
the total risk is found to be minimized for 


. 1 1/3 t 
n= = --. 
2Trce™ r 


In particular, when the prior information about 0 is vague, we obtain fi = (2arc’) 


-1/3 


, by letting tT 0. 
Example 6.47. Suppose X,, X,, ..., X, is a random sample from Pois(9) and the prior distribution of 0 


is Gamma(q, B), a, B > 0. If the loss function L(8,a) = (0- a)” /@, then the Bayes estimate of 0 is 
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and the corresponding posterior expected loss is 


B+n 1 


E| 6'| 6 ; = 
a+) ix, B+n 


Hence the Bayes risk with respect to the gamma prior is r(g) = E(1/(B+n)), where the expectation 


is taken with respect to the marginal distribution of X. The total risk T(n) = (B + n)" +cn iS minimized 


for 


Note that if the cost per observation is very high (c — ce), the optimum number of observations 


should be taken as zero, since n cannot be negative. 
Example 6.48. Suppose that X =(X,,X,,...,X,) is a random sample from U(0, 9) and the prior 


distribution of 0 is Pareto(@,, 4), & > 2. Since the posterior distribution of © is Pareto(@,, & + n) where 
oO, = max(X,,., @,) and x 


«n 1S the largest order statistic. The Bayes estimate under SELF, is 


B@|x)=(a+njor"{ ee*™'de - SFR 4 
©, at+n-1 
and the posterior expected loss is 
a+ n)o, 
Var(8| x) = ¢ 18, = 
(a+n-—2)a+n-1) 
: : a+n 2 ‘ : 
The Bayes risk of g is r(g) = E[Var(6| X)] = E(@,), where the expectation is 


(a+n—2\a+n-1) 


taken with respect to the marginal density of X. Since the marginal density of X is 


7, ol 
m(x)=[ (| x)g(0)d0 = | = 0868 


1 
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-o-n 
1 


= 00; . 
(a+n) 


Since | m(x)dx =1, we have 


1 a+n 
| a+n ae 
; oO, era) 
and 
; > a Oo, a , 1 
E(@) =| @, ~, dW, = of | ma 
56 6+ Qatn , @ 
a , @-2+n oa(a+n-2) , 
= o, >= Q,. 
atn (a-2)0,° (a+n)(a-2) 
Hence, 


a 0, 1 
a 
a-2\ (a+n-1) 


On minimizing the total risk T(n) = r(g)+ cn, we have 


. 200, ) 
fe | ged), 
c(o — 2) 


6.9 GENERALIZED MAXIMUM LIKELIHOOD ESTIMATE 


Definition 6.11. The generalized maximum likelihood estimate (gmle) of 6 is the largest mode 
6 of g(0| x). 


—(x-8) : 
Example 6.49. (Berger, 1980) Suppose f(x|6) = ( if x >0 


otherwise 
and the prior for 0 is g(8) = 1/m(1+6’). The posterior density 


—(x-6) 
————— ifx>6 
g(8| x) = m(x)m(1+ 0 ) 


0) otherwise . 
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Since, g(9|x)=0 if @> x, therefore we should concentrate only on the interval 6 < x. It can be 


d 
easily checked that g(®| xX) is an increasing function of 6 for @<x (since —g(0 | x) > 0) . Hence 
de i 


it is maximised at 6 — x. Thus, the gmle of 0 is x. 
Example 6.50. (Example 3.12 continued) The posterior distribution for 8 was 


100-8) x=0, 0.4<6@<0.6 
g(8| x)= 
100 x=1,04<0<06. 


If x = 0 is observed then the posterior distribution is decreasing function of 8. Therefore, the gmle of 
8 is 0.4. However, if x = 1 is observed, the posterior density is an increasing function of 8. Hence 
6 = 0.6 is the gmle of 0. 

Example 6.51. (Example 4.16 continued) A lot of 1000 items is received of which @ are defective. 
Suppose we select a random sample of 10 items from this lot and X is the number of defectives in the 
sample. Then the distribution of X, given 0, is hypergeometric 


6 \/ 1000-6 


If the prior pdf for 6 is Bin(1000, 0.05). Then posterior pdf for (@—x) is Bin(990, 0.05) having mode in 
the interval ((n+1)6—1, (n+1)0) = (48.55, 49.55). Since in our case (n+1) @ = 49.55 is a fraction, the mode 
of (@—x) is given by the integral part of (n + 1)@. Hence the gmle of @ is 49+x. 

Example 6.52. Suppose X ~ Pois(8) and the prior for 8 is Gamma(a, B).The posterior for 0 is 
Gamma(o+x, B+1) which has a unique maximum at 6 = (&+x—1)/(B+1), which is the gmle of 0. 

Remark 6.52. If we take non-informative prior g(8)«<1, then g(6|x) « ¢(6|x). Thus the classical maximum 
likelihood estimator of 6 will coincide with the gmle. 

Remark 6.53. In case the posterior distribution is unimodal and symmetric then gmle will be same as 
the posterior mean or posterior median. 

Remark 6.54. If the loss function is 


1 if |@-abe 
L(0, a) = 


0 otherwise 


and the posterior distribution is unimodal then the Bayes estimate of 8, which minimizes the posterior 
expected loss, is posterior mode. This suggests that the gmle corresponds to the 0-1 loss function from 
the Bayesian decision theoretic viewpoint. 
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6.10 LINEAR BAYES ESTIMATION 


Suppose that the random variable X has a pdf f(x|®) and the prior distribution of 0 is g(@). 
Consider just one observation x on X. We are interested in obtaining the Bayes estimate 6(x) of 8 which 
is a linear function of x under SELF. Let us write 5(x) = a + bx, where a and b are chosen to minimize 
the posterior expected loss. Thus, we should minimize 


I(a, b) = lJ (a+bx —0) f(x | 8)g¢(8)dOdx (6.59) 


os 


On differentiating with respect to a and b, and equating them to zero, we have 


[J G+ bx —e)F(x | @)g(@)dedx = 0 


and 

lJ x(4+b—6)f (x | ®)g(@)dOdx =0 
or 

4+ bEE(X | 6) = E(0) (6.60) 
and 4BE(X | 0)+bEE(X’ | 6) = E(@E(X | 6)) (6.61) 


The simultaneous equations (6.60) and (6.61) may be rewritten in the matrix form as 


1 EE(X|0) \(a E(6) 
= (6.62) 


BE(X|0) EE(X’|@) 6] | B@ECX |e) } 


Using Cramers’s rule for solving the equations, we have 


4 = (E(®)EE(X’ | ®)— BE(X | ®)E(@E(X | 6))/D, 


b = (E(@E(X | 6)) — E(®)EE(X | 9))/D, 


and D=EE(X’/0) — (EE(X|@)y. 

Example 6.53. Suppose X ~ Bin(n, 0) then 
E(X|6) = n@ and E(X?0) = n6(1-8) + n’6’. 

If the prior distribution of 6 is U(O, 1), then 


1 


n+2 


a= =b. 
Thus, the linear Bayes estimate of @ is 
5(x) =a + bx =(x +1) An +2). 
Remark 6.55. Note that U(0, 1) is Beta(1, 1) distribution which is a conjugate prior for 8. Thus, the 
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Bayes estimate of 8 under SELF is 

E(6\x) = (x+D/(n+2), 
which is same as the Bayes linear estimate obtained above. 

In general, if the prior distribution happens to be different from the conjugate prior, we may not 
expect E(6|x) to be same as the linear Bayes estimate. 
Remark 6.56. The linear Bayes estimate is the closest approximation to the E(6|x) in the class of linear 
estimates. 


Example 6.54. Suppose X~Pois(8) and @~Lognormal(ti, 6°). Since E(9)=e""*, Var(®) =e" (e° —1), 


and 
a = (E(@)) /D, 


b = Var()/D, 
and D = E(6) + Var(6), 
the linear Bayes estimate of 0 is 


sae 1l+(e —-I1)x 


=o /2 o : 
e +e -1 


Remark 6.57. The posterior distribution of 8 in Example 6.54 is 


e 0" exp[—(log 9—-L)’ / 20°] 


g(8|x)=- 
| e°6"" exp[—(log 8-1)’ / 20° ]d® 


0 


| ©°6* exp[-(og 6-1)’ /20°}40 
with E(®|x)=— 
| e°6"' exp[—(log 8-1)" / 20° de 


0 


It must be evaluated numerically. However, using Lindleys’ approximation (see Section 10.2). 
log x — 
E(0|x)=x- Set ah , 
= 
which is not linear in x. 
Remark 6.58. In the derivation of the estimates 4 and b , we note that they depend only on 


E(8), E(X|@), E(@ECX|8)) and Var(X|0). Thus one does neither require a full specification of the prior 


distribution nor a full specification of f(x|6). 
Example 6.55. Linear Bayes estimate under linex loss function (Ganji, 1996) 

Suppose a single observation x is drawn from f(x|®) where 6 has a prior distribution g(8). In order 
to obtain the linear Bayes estimate under linex loss function 
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L(6, 5) = exp(c(5—6))—c(5—6)—-1, c #0, 

is given by minimizing 
I(a,b) =f J (exp(c(a+bx —8))—c(a + bx —6) — lf (x | @)g(0)d@dx 


For convenience sake, let us write 


8(x) = (a+ bx)/c. 


Thus, I(a,b)=J J (exp(a+bx —c0)—(a+bx —c@)—Df (x | @)g(0)dOdx 


On partially differentiating with respect to a and b and equating the derivatives to zero, we have 
e"E(e“°M,,(b)) =1 
and 
a —c8 d 
e Ej e 3p Me = EE(X | 8) 
Ib 


These equations must be solved to obtain 6(x). 
To illustrate, let us consider X ~ N(O, 1) and 6 ~ N(O, 1). 


Since 
M,j9(b) = exp (—0b+ b’ /2) 
M.,,(b) =(9+b @b+b*/2 
ap Mao (b) = (8-+b)exp(—6b +b’ /2) 
and 
EE(X | 0) = E(6) =0, 
we have 
—2a = b’? + (b—c)” 
and 


(2b-c) exp((c-b)?/2) = 0 
Thus, 
a= -c7/4 and b =c/2. 
The linear Bayes estimate of 8, under linex loss, is 


co i; 2x—-C 
d(x) = +—x |/c = : 
4 2 4 


Remark 6.59. Recall that the Bayes estimate under linex loss function is also (2x-c)/4 which is a linear 
function of x. 

Example 6.56. (O’ Hagan and Forster, 2004) Let X ~ N(0, r) where precision r is known. The Bayes 
estimate E(0@(0)|x) / E(@(8)|x) of 8, under weighted SELF, is linear in x if, and only if, the prior 
distribution has the form g (8) « n(®)@(8), where n(®) is normal density function and @(®) is a non- 
negative weight function. 
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Proof. The posterior distribution of 8, given X , under the prior g (8) « n(9)@(8) is 


¢(0|x)n(@)a(6) 
[° ¢@lx)n@ax@a0- 


g,(8|x)= 


with posterior mean 


li 000(8) (8 |x )n(@)d® 


E,(0|x) == 
[oye (6|x)n(o)a0 


J e(0)g(@ | x46 


[-o()g(6| x)a0 


_ E(8@(8) | x) 
E(@(8) | x) 


where the expectations in the ratio are taken with respect to posterior distribution g(@| x) when the 
prior is normal n(68). 

Recall duality between loss function and the prior distribution, i.e., if g(6) < n(6)@(8) is the prior 
and loss function is L(@,a) then the Bayes estimate of 8 remains the same when the prior is n(@) and 
the loss function is @(@) L(®, a). Since the conjugate prior distribution for the exponential family of 
distributions is characterised by the linearity in x of the posterior mean (Diaconis and Ylvisaker (1979)), 
therefore, the linearity in x of E(0@(8)|x)/E(@(8)|x) follows from the duality between prior and the loss 
function. 


6.11 CREDIBLE INTERVALS 


Suppose that g(6|x) is the posterior distribution and we are interested in computing the probability 
that the parameter 0 lies in an interval (a, b). This is given by 


b 


| g(8|x)d@ if 6 is continuous 
Eee GO. (6.63) 
y" g(0|x) if Gis discrete 


a<@<b 
This probability is a measure of the degree of belief that 8€(a, b) given the sample and the prior 
information. We may like to consider the inverse problem, i.e, we may like to find an interval (a, b) such 


that P(e (a,b)|x)is @. This interval may not be unique. However, if the posterior pdf is unimodal 
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then it is possible to obtain a unique interval by imposing the conditions that its probability content 
be a, say & = 0.95, and that the values of g(6|x) over the interval be not less than relating to any other 
region with the same probability content. 

Result 6.11. Suppose that g(6|x) is a unimodal posterior distribution of 6. The shortest interval (a, b) 


b 
for which | g(6| x)d8 = is obtained when g(alx) = g(b|x). 
b 


Proof. The mathematical formulation of the problem is to minimize (b—a) subject to il g(8|x)d0=a. 


a 


Let us use the method of Lagrange multiplier. Consider the function 


H=-99)| e@]xx0-« 


where A is a Lagrange multiplier. On partially differentiating with respect to a and b, we have 
oH r og(0|x ab da 
=-1+A | BC) pep ai 

da da da 


oa 


a 


=-1 —Ag(alx), 


and OH eh | eee) 
db ob 


ob da 
d6+ g(b| x) ——g(a|x)— 
db db 


a 


=— 1 —Ag(b]x). 


ing ==" =0, we have g(a| x) = e(b|x) 
Setting “arn , we have g(a|x)=g(b| x). 
Definition 6.12. A 100(1-a)% credible interval for @ is an interval (a, b) such that 
PO € (a, b)|x) = (1-). 
We often use the notion of highest posterior density (HPD) to determine an appropriate credible 
interval. Such an interval requires 


(i) P(Oe (a,b)|x)=1-a, and 
(ii) | The posterior probability of a<@<b is greater than that for any other interval for which (i) 


holds, i.e., if (a, b) is an HPD credible interval then for any 8, € (a,b) and any 8, € (a,b), 


(8, |x) = (8, | x) 
and conversely. 

In case, the posterior distribution is asymmetric we may construct HPD credible interval (a, ath), 
h > 0, such that h is as small as possible, subject to the condition 


P(®e (a,a+h)|x)=1-a. 
The condition derived in Result 6.11 is now 


g(a+h| x) = g(a| x). 
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Definition 6.13. The 100(1-«)% HPD credible interval of @ is an interval (a, b) of the form 
C={0e@, st. 2(0|x)>k(a)} 


where k(q) is the largest constant such that P(@e C|x)2>=1-a. 


Remark 6.60. In particular, if g(6|x) is unimodal and symmetric about zero then a = —b. 
Example 6.57. Suppose X,, X,, ..., X, is a random sample from N(@, 1), then the classical 95% confidence 


= = .96 
interval for 0 is [x - 178 X+ a | having a length {3 ae ) : 


a oe : 


In order to construct the Bayesian 95% HPD credible interval, we need to specify the prior 
distribution of 6. Suppose, the prior distribution of 6 is non-informative g(6)«1. Since, the posterior 


Vn vn 


1.96 1.96 
distribution of 8, given xX, is N(x, 1/n), the 95% HPD credible interval will be fs —-—, Xt | 


; 1.96 
having the length 2} —= |. 
vn 
It is interesting to note that, for given observed values of the sample, both the intervals are 
numerically same and also having the same length, but having quite different interpretations. There has 
been a lot of debate about such an agreement and, in cases of prior ignorance, whether classical 
approach got acceptability or the Bayesian approach. 
However, if we decide to choose the N(0,1) prior for @ then the posterior distribution of 0 is 


nx 1 
N “ and the 95% HPD credible interval is 
n+l n+l 


i: nx 1.96 nx : 1.96 
n+l vn+1 n+l Vn+1 


: 1.96 
having a length 2 


Joe S 


1.96 
} Thus, the length of the HPD credible interval is reduced from 2 =| 


1.96 
to | ) which may be significant when n is small. This reduction in length is due to a more 
n+1 


informative prior. 


Remark 6.61. As in the classical case, we may not be able to achieve P(@e€ (a,b)) =1—a precisely 


but can only ensure that P(@€ (a,b) | x) =1—a. This happens when 6 is a discrete random variable. 


Remark 6.62. In the classical framework, the confidence interval is a random interval containing the 
fixed unknown value of @ and the assessment of its probability of actually containing 9 is in terms of 
repetitions of the experimental situations. There is no way of judging whether a computed classical 
confidence interval based on the observed sample, does or does not include 8. The Bayesian credible 
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interval, on the other hand, has a direct probability interpretation P(@e€ (a,b)|x) =1—o and is 


completely determined from the current observed data x and the prior distribution. 
Example 6.58. Suppose the posterior distribution of 8 is Gamma(a, 8) for a>0 and B>0. This distribution 
is unimodal but asymmetric and, therefore, choosing equal tails to construct the HPD credible interval 
will not be correct. The 95% HPD credible interval (a, ath) may now be obtained by solving the 
equation, 

g(a + h]x) = g(alx) 


-l  —B(ath) -1_—Ba 
or (ath)*'e " =a*'e? 


ath \" ci 
or =e 
a 


a =h(exp(Bh/(a-1))-1) 


subject to the condition 


-l 
> 


ath B* 

| =e" ‘a0 = 0.95. 

Pa) 

Analytical solution is generally not feasible and, therefore, one may need computer assisted numerical 
techniques to construct HPD credible interval. 

Remark 6.63. The situation becomes more complicated for the multimodal posterior distribution. The 
highest mode is determined and the HPD credible interval is constructed around it. The obtained HPD 
interval may not be unique. It may happen that the HPD region is a union of two disjoint intervals. 
Such situations occur when we consider posterior distributions obtained from mixtures of prior densities 
or in cases of notorious distributions like Cauchy distribution. The joint intervals often occur when 
there is clashing information. Natural conjugate priors generally mask the clashes since they yield 
unimodal posteriors. 

Remark 6.64. An approximation to HPD credible interval may be obtained through the use of the 
normal approximation of the posterior distribution. 

Remark 6.65. Posterior intervals based on non-informative priors were called credible intervals by 
Edwards, Lindman and Savage (1963) and Bayesian confidence intervals by Lindley (1965). Box and 
Tiao (1973) introduced the concept of highest posterior density (HPD) interval and called shortest 
posterior interval based on non-informative prior which is locally uniform as “Standardized HPD 
intervals.” HPD intervals are not invariant under transformations of the parameter unless the 
transformation is linear. 

Example 6.59. Suppose that a random sample X,, X,, .., X, is drawn from N(®, 6?) with 6 known. 


a 


Writing s° = YG, —6)’/n, the posterior distribution of 67 is 


i=1 


go peirary"(®] oye exp(—ns /20°), oO >0. 


When we take a non-informative prior g(0*) « 1/0, i.e., o ~ ns YX, , LOO(1-«)% HPD interval is such 
that 
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ns” 5 ns” 
P 0 - a =l-a, 
Xx X5 


where x and Ce are the upper and lower points of to , for various values of a and df n, are available 


in Tables for making inferences about the variance of the normal distribution by D.V. Lindley, DA. East 
and P.A. Hamilton (Biometrika, 47, 1960, pages 433-438). 


However, if 8 is also unknown, we may define s = VG, —x) /(n—1), then the 100(1-a)% 


i=l 


HPD credible interval is given by [x (x, 9" fi ; VG, 7 fx } where X, and x, are 
i=l isl 

upper and lower points of x? with (n—1) df. 

Example 6.60. Suppose the time to failure of some device is assumed to be exponential random variable 

with parameter @ and the prior distribution for 8 is Gamma(m, m/®,) where m and 9, are known. The 


posterior distribution of @ for a given random sample of n failure times X,, X,, ..., X, is 


Gamma(m+n,nx+m/8,).A 100(1-a)% equal tail credible interval for the time to failure can be 


computed by considering 


1 1 1 
p=P <O< x |=P/|c,<- 
c, Cc, 0 


=| a™" ——e “d0, where a=nx+ m/6,. 


ele 
Since the integrand is the kernel of ¥%7-pdf with 2(m+n) df (if m is an integer), the area between 
x°(a/2) and x7(1— a/2) is 1-c. (Note that we are assigning equal probabilities to the two tails). We can, 
therefore, take 2a/c, = x°(o/2) and 2alc, = 7°(1-a/2). Thus 95% credible interval to the mean time to 


failure is (2a/x°(0.975), 2a/%°(0.025)). 


Example 6.61. Suppose a random sample of n independent observations is drawn from N(0, 6”) where 
both 6 and o? are unknown. If the joint prior distribution 6 and 6? is assumed to be non-informative 
such that g(8, 6) « 1/o, then the marginal posterior distribution of 6 is known to have a t-distribution 


with (n—1) df, X location parameter, and scale parameter s/n, where s? is wes - x) /(n—1). Since 


i=1 


it is a symmetric and unimodal distribution, the 100(1—a)% HPD credible interval for 0 is 
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[r-(e}(ze)} 


where t,,, is (1-o/2)th quantile of the t-distribution with (n—1) df. 
Example 6.62. Suppose we have two random samples of size n, and n, drawn from the normal 


populations N(@,, 6’) and N(®,, 6”) where 6? is unknown. If x, and x, are the sample means and 


s, and s, are the sample variances, where 


s =) (x,-%) An,-D, i=1,2, and 
i=l 


ee (n, -1s, +(n, -1)s, 


n,+n,-2 
then the posterior distribution of 8,—0, can be seen to be t with (n,+n,-2)) df, x,—x, location 


parameter, and scale parameter s*(n," + n,) . Therefore, 100(1—a)% HPD credible interval for 0-8, is 


-1\1/2 — 


== — -l = =] -1\1/2 
(x, —x,-t,,sM@, +n,) ,x,-x,+t,,,s(, +n, ) ) 


Example 6.63. It may happen that the posterior distribution is a monotone function of 8. In such a 
situation, the HPD credible interval will be either the left tail or the right tail, depending on whether 
the posterior distribution, is a decreasing or an increasing function. 

Suppose we toss a rupee coin one time and let X = | if we get a head, and X = 0 if we get a 
tail. If the probability of getting a head is 9 and the prior distribution of 6 is uniform on the interval 
(0.4, 0.6). 

(i) | Obtain 95% HPD credible interval for 9 when X = 0 is observed. 
(ii) | What will be the 95% HPD credible interval for 9 when X = | is observed? 
Solution. The posterior distribution for 0 is 


6*(1—6)'*(1/2 
gp. 0.4<0<0.6 
| (1/2)0"(1—6)'* 0 


0.4 


1-0 
= ) if x =0 
| a-eyae 
= : | 
if x=1 
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: 0.4<86<0.6. 


101-6) if x =0 
100 if x =1 


(i) Since the posterior distribution, for x = 0, is a decreasing function over (0.4, 0.6), therefore, the 
HPD credible interval is (0.4, a), such that, 
| 2(0| x =0)d0 = 0.95. 
0.4 
Since, a = 0.587, the HPD interval is (0.4, 0.587). 
(i) Since the posterior distribution is an increasing function over (0.4, 0.6) when 
X = 1, the HPD interval will be (a, 0.6) such that 


0.6 
| g@|x=)d0=095. 
Since a = 0.412, the HPD interval is (0.412, 0.6). 
Example 6.64. (Berger, 1985) Suppose the posterior distribution of the parameter @ is 
g(O|x) = Ce*(1+67)"', 8€ (0, x) which is a monotonic function of 8. The HPD credible interval is of the 
form (a, x) since g(6|x) is an increasing function of 8. However, if we transform the posterior distribution 
as a function ) = exp(8), i.e., 8 = logo, we have 


2,-1 


g(o|x) =C(+(logo)), 1<o<e’, 
which is a decreasing function of in (1, e*). Therefore, the 100(1-a)% credible interval for will be 
an interval of the form (1, b). 
Remark 6.66. We may observe a conflict in this example where in HPD credible interval in the original 
parameterization is the upper tail but for a monotonic reparameterization, it is the lower tail. 


Chapter 7 


Hypothesis Testing 


In many circumstances there is a problem of comparing competing hypotheses: for example, we may 
be interested in comparing the hypothesis that the proportion of defectives in a lot is less than a 
specified value and the hypothesis that it is greater than that specified value, or we may wish to 
perform the test of significance in which our interest is in deciding whether a sharp null hypothesis 
is significant or not. The Bayesian approach provides a procedure of revising prior probabilities 
associated with the hypotheses in the light of observed data. In some situations we may have an 
explicitly given loss function and, therefore, the Bayes principle, discussed in Chapter 6, may be invoked 
to reach a decision of accepting (or rejecting) a hypothesis. 

The problem involving two decisions are sometimes known as problem of testing hypotheses. 
Thus, if decision d, denotes “accept the null hypothesis H,” and the decision d, denotes “accept of 
the alternative hypothesis H,”. If these two hypotheses represent mutually exclusive events then 
choosing the decision d, will amount to accepting H, and rejecting H,. 

In this chapter, the inferential as well as decision theoretic approaches to compare test 
hypotheses are discussed. 


7.1.©PRIOR AND POSTERIOR ODDS 


A measurement of uncertainty, known as odds (also known as betting quotient), is commonly 
used in gambling. Bookmakers (or Bookies) quote odds in sporting events such as horse races or 
“cricket matches.’ 

Definition 7.1. Let an event A has probability P(A) of occurring. The odds in favour of A are 
P(A)/(1—-P(A)) and odds against A are (1 — P(A))/P(A). 

We may find probability of an event if odds are known. Let O(A) denote the odds in favour of 

the occurrence of an event A then 


P(A) = 2 : 
1+ O(A) 
The concept of odds is an important one in the evaluation of evidence. The phrases “Odds on” 
and “Odds in favour of” are equivalent and are used as the reciprocal of odds against. 
Example 7.1. Consider the cricket team which is “3 to 2 on” to win its match. The phrase 3 to 2 is taken 
as the ratio 3/2 as this is odds on the event A that the team will win its match. Thus, 
O(A) = P(A). —P(A)) = 3/2 and, therefore, P(A) = 3/5. 
In comparing two hypotheses, the ratio of the probability that H, is true to the probability that 
it is false is called the “Odds on” the hypothesis H,. If we write, the odds on H, given data x, as the 
symbol 


(7.1) 
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P(H 
O(H, | x) = Sune (7.2) 


where H,, the alternative hypothesis, is the complementary of H,. The Bayes theorem for determining 
P(H, |x) and P(H_|x) gives 


P(H, )P(x | Hy) 


P(H, |x) = P(x) 
and 

P(H, |x) = aD LHD 
We have 


P(x | H,) 
P(x|H,)/ Cs) 


O(H, | x) = O(H,) 


Thus, posterior odds (7.2) on H, is the product of prior odds on H, and dimension less factor known 
as likelihood ratio. 

It may be interesting to observe that posterior odds on H, in (7.3) may be easily obtained from 
(3.4) by denoting events A, A’, and B as H,, H,, and the observed data x. 
Remark 7.1. In many applications it is convenient to take the logarithm of the odds because of the 
fact that we can add up terms. During 1940’s and 1950’s, logarithms to the base 10 were used for 
numerical convenience. Now a days, results are sometimes expressed in terms of natural logarithm. 
Remark 7.2. Posterior evidence for hypothesis H is defined as e(H|x) = 10 log ,O(AIx). Therefore, the 
posterior evidence for H, is equal to the prior evidence plus the number of decibels provided by 
working out the log likelihood, that is, 


P(x | H) 


e(H |x) = e(H) + 10log,, ein (74) 


where H is the alternative hypothesis. Decibel is a unit used for measuring evidence. 
Definition 7.2. The ratio of posterior odds to prior odds 


O(H, |x) _ P(x|H)) 
O(H,) P(x |H,)’ (7.5) 
is known as Bayes factor in favour of H, and is denoted by B,,. 
Thus, Bayes factor is a ratio of conditional probabilities of the data at hand. In other words, 


Bayes factor is a ratio of likelihoods. 
Remark 7.3. Bayes factor evaluates the modification of the odds of the hypothesis 


H, :9¢€ ©, against H, :@¢ ©, due to the observation x and can be compared to unity. It depends on 


the prior information. However, since it partly eliminates the influence of the prior beliefs and 
emphasizes the role of the observations, the Bayes factor is sometimes proposed as an ‘objective 
Bayesian answer’. 

Example 7.2. I have two boxes B, and B,, B, has four slips and B, has twenty slips and their slips are 
labeled 1,2,3,4 and 1, 2, ..., 20, respectively. Let us assume that the slips are equally likely to appear in 
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any draw for each of the two boxes separately. Consider the event E to be that B, is selected and let 
the event F be that B, is selected. The chance of selecting either one of the two boxes is 1/2. Now a 
slip is drawn from the chosen box and the result is 13. Then P( E | number 13 is observed) = 0 (we 
don’t need Bayes rules that a twenty slip box was picked). 


If the result was 2, which box was picked up? The likelihood of E is P(2|E) = 1/4 and the likelihood 
of F which is ‘not E’ is P(2 | F) = 1/20. The Bayes theorem gives 


P(E)P(2|E) 1/4 5 


P(E|2)= P(E)P(2|E)+P(F)P(2|F) 1/44+1/20 6 


Note: The prior probability of choosing B,, which was 1/2, is increased to 5/6 when we observe that 
2 has occurred. 


The Bayes factor against E is 


P(2|F) 1/20. 1 
P2|E) 1/4 5 


> 


and the Bayes factor in favour of E is 5. We may, therefore, say that the observation 2 is 5 times more 
likely for the Box B, than for the Box B.,. 


Remark 7.4. Laplace (1820) explicitly anticipated the concept of Bayes factor and C.S. Peirce (1878) 
came very close to the weight of evidence (logarithm of Bayes factor). The concept of Bayes factor 
is explicit in Wrinch and Jeffreys (1921). In 1936, Jeffreys called the weight of evidence ‘support’. 


Remark 7.5. Jeffreys (1961) recommended interpreting the Bayes factor in units of 1/2 of the log to 
the base 10 scale. The criteria was as follows: 


Table 7.1 


Jeffreys’ Bayes Factor Classification 


Log1o(Bio) Evidence against H, 


0O-'% Not worth more than a bare mention 
%-l 3.2-10 | Substantial 


1-2 10-100 | Strong 
>2 Decisive 


These rough categories, given in Jeffreys (1961), seem to furnish appropriate guidelines. 


The labelling of these categories should not be considered as a calibration of the Bayes factor 
but rather a rough descriptive statement and standard of evidence in scientific investigation. Kass and 
Raftery (1995) considered twice the natural logarithm of the Bayes factor. They rounded and used 20 
rather than 10 as the requirement for strong evidence against H,. The reason for using twice the natural 
logarithm of the Bayes factor was that similar scale has been used by classical statisticians for 
likelihood ratio test statistics. Furthermore, their experience was that the categories in Table 7.2 
furnished appropriate guidelines. 
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Table 7.2 
Bayes Factor Classification of Kass-Raftery 


2Log.(B10) Evidence against Ho 


0-2 Not worth more than a bare mention 


2-6 Positive 
6-10 Strong 
> 10 Very strong 


This new classification is slightly more conservative than that of Jeffreys. 
7.2 BAYES FACTOR FOR SIMPLE VERSUS SIMPLE HYPOTHESES 


It is often convenient to summarize the evidence in terms of posterior odds. For example, the 
statement “posterior odds in favour of H, is 10” conveys the conclusion that H, is 10 times as likely 
to be true as H,. In some situations, the Bayes factor can be interpreted as the odds for H, to H, that 
are given by the data. For example, let us consider simple null and alternative hypotheses H, : 8 = 0, 
and H, : 9 =8.. Here the parameter space © = {0,, 0}. If the sample is drawn from a population having 
pdf (or pmf) f(x | 8) then 


Py =P(O=8, |x) = Tot 18o) 
m(x) 
and 
p, =P(@0=0,|x) =), 
m(x) 
where 
T, =P(@=90,), a, =P(O=9,), 
and 


m(xy=)° P(O=6,)f(x |6,) = mf (x |9,)+7,f(x| 9,). 


i=l 


Since the posterior odds in favour of H, is 


— Po _ Tf (x | 8,) 
p, mf(x]@,)’ 


Bayes factor in favour of H, is 


O(H, | x) 


B= Po/Pi = F(x | @)) (7.6) 
T,/m, f£(x|9,) 


Thus, By, is the likelihood ratio of H, to H, which is, in general, considered as the odds for HH, 
to H, that are given by the data. 
Example 7.3. Let us assume that a coin has been fairly and independently tossed n times and that we 
observe x number of heads and (n — x) number of tails. We wish to compare H, : 0 = 1/2 against 
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H, : 0 = 1/4, where 0 is the probability of getting a head. If we are equally ignorant about H, and H,, 
then the prior probabilities P(H,,) = P(H,) = 1/2. The posterior odds in favour of H, is 


_ PCH |x) _ Po _ PCH) F(x] Ho) 
PCH, |x) p, PCH, f(x | H)) 


ACIET. sy 
oer © 


In particular, if the coin is tossed 5 times and we observe 3 heads then the posterior odds in 
favour of H, is 32/9. This suggests that the data have changed our prior probabilities from 1/2 to 
32/41 and 9/41 for H, and H,, respectively. 

However if we observe 5 heads in 5 tosses, the posterior odds in favour of H, changes to 32 
and, therefore, our prior probabilities change from 1/2 to 32/33 and 1/33 for H, and H,, respectively, 
after the data is observed. On the other hand, if we get 5 tails in 5 tosses, the posterior odds in favour 
of H, is 32/243 and, therefore, our prior probabilities change from 1/2 to 32/275 and 243/275 for H, and 
H,, respectively. 

Example 7.4. Suppose X ~ N(0, 1) and we are interested in testing whether H,: 8 = 0 or H,: 8 = 1. 
Assuming that, a-priori, H, and H, are equally likely then, based on a random sample X,, X,, ..., X,, 
the posterior oddds in favour of H, is 


P(H, |x) _ P(H,)f(x|@=0) 
P(H,|x) PCH, f(x |@=1) 


O(H, | x) 


O(H, | x)= 


n 


n/2 
1 1 
Since f(X,,X),...X, o-(] exp| 2). O -9) | 


20 i=l 


(a) oxo] -2{5 ix -wF +ntx-07'}} 


we have 


i=l 


- = exp| -Beax of. 
0.5 oxo] He (x, - x) n-0"} | 


In particular, if n = 10 and x = 0.5, we have O(H, | x) = 1. Thus the observed data does not 
discriminate between 9 = O and 9 = 1. However, if the observed value of x = 1, then 
O(H, | x) = exp(-5) = 0.006738. Hence, the posterior probability of H, is 

O(H, | x) 


= 0.0067 
1+0(H, |x) 


P(H, |x) = 
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when X = 1 is observed for the same sample size. 
Remark 7.6. Since the posterior distribution of 8, based on n iid observations from f(x|6), is same as 
the posterior distribution of 0, given the value of the sufficient statistic t, the posterior odds 


P(H, |t) _ P(H,)f(t| @=0) 


Ot Fa Ty PFC |O=D wn 


Example 7.5. (Example 7.4 continued) Note that the sufficient statistic for 0 is X having sampling 
distribution N(@, 1/n). Thus, we have 


1/2 


0.5|— | exp] — ee 
20 2 


” 1/2 1. ; 
0.5 | exp(— 5-1" 


Example 7.6. (O’Hagan and Forster, 2004) Suppose we wish to compare completely specified negative 
binomial and Poisson models based on n iid observations Xx X,, er Xx. We can specify 
H, : X ~ NBin(1, 8,) and H, : X ~ Pois(9,). Since the two hypotheses are simple, the posterior odds in 
favour of H, 


O(H, |X) = = exo(-F02x- ]- O(H, | x) 


Il ue "fc 0" 


O(H, |x) =— 


n aio 
x, .-8) -1 = a 
| | Ore (x; !) ren (T] «] 
i=l : 
i=] 


6? (1-6, )™ 


Let us assume that the negative binomial and Poisson distributions under consideration have 
equal means, that is, 8, = (1 + 8,)"'. For example, let us take 6, = 2 so that 8, = 1/3. If we take a sample 
of size 2 and observe X,=X,= O then Ox, X,) = e'/9 = 6.1. Thus, negative binomial distribution is 
more plausible than Poisson distribution. However, if both the observations are 2 each then 
OH, | X) X,) = 4e*/729 = 0.3, which suggests that Poisson distribution is more plausible. If X,=X,= 1, 
then O(H, | x,, X,) = e*/81 = 0.68 indicating slight decrease in plausibility for Poisson model in 
comparison to the observation x, = x, = 2. 


Remark 7.7. Let us assign prior distributions to 6, and 0,, g,(0,)=Beta(a,,B,) and 
g,(8,) =Gamma(a,,B,). Then 


1 1 n+O)-1 nx+By —1 
grt!” @, Po 
fy (X),Xy50X, |H,) =] fo), %55---5X, | 95) 85 (8,)d8, = 0 0 
0 1 2 0 J 0 1 2. 0 0 0 0 J B(Q,,B) 


B(a, +n,B, +nx) 
B(a,, Bo) 


c) 


similarly, 


—n6 er* Be 6” He 9Bi 
Tix,!)  I(o,) 


Te 
i kaa; (S| 
0 
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Oy 


_ \ Ta, +nx) 
(IIx, !)'(o,) (B, +n)" ” 


and 


1 
E(X|H,) = | BCX |Hy,6,)g9(8,)4, 
0 
Since, under H,, X ~ NBin(1,®,) we have E(X|H,,®,) =(1—®,)/,, 


r (1-8, \e5°"(1-8,)%* 
Bla | 6, Bo,.B,) 


he B(Q, =L By +1) 7 Bo 
B(O, By) Oy , 


Under H,, X ~ Pois(®,) we have E(X|H,,,) = 8 
Therefore, 


E(X|H,)= | E(X|H,,6,)g,(0,)d6, 


0 


-[ 9, By Ore aa 
T(a,) | 


_{ Bi \E@,+) _ oO, 
he) ae Bp 


Hence, if we specify the hyperparameters a,, B,, a, and B, such that © 
two predictive models having the same mean, the Bayes factor, B 


, = B,8,, amounting to the 
op in favour of H, is 


B. = f5(%,,%5|H,) _ BC, +2,B, +x, +x) D(a, +x,+x,) 
Ge <> |H) B(a,,B,) x, !x, T'(a, (B, +2)" 


_ |i5 when x, =x, =0 
~ |0.29 when X, =X, =2 


when o&, =P, =l anda, =f, =2. 
However, for prior specification a, =B, =30 and a, =B, = 60 


5.5 when x, =x, =0 
Bo = oe 
0.3. when x, =x,=2 


This example suggests that the Bayes factor depends on the prior distributions specified for the 
parameters of each model. 
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7.3. BAYES FACTOR FOR COMPOSITE VERSUS COMPOSITE HYPOTHESES 
Let us consider the situation where H, and H, are both composite hypotheses. Suppose 
H,:0¢ ©, and H,:0¢ 9, such that ©, VO,=© and O, NO, = (null set) and 1m, and m,(=1-1,), 
as before, are the prior probabilities of H, and H,, respectively, and g,(8) and g (®) are the proper prior 
pdfs defined over the sub-parameter-spaces ©, and ©,. Let us define the prior density g(®) over the 
whole parameter space © as 
TZ (9) if 06 O 
2(0) = 080 (9) 0 
w2,(0) if 0€O,. 


Clearly, g(8) is a proper pdf over the whole parameter space © since 


(7.8) 


| g(0)d0 = | T,&(8)d0 + | n,g,(0)d0 =n, +2, =1. 
(0) (ory e, 

The posterior probabilities of H, and H, are 

g(0)f (x | 6)d6 


P, =P(Ge ©, |x) = ] or J m(x) 


00, 0€ ©, 


Tv 
=— | f(x|6)g,(0)d8, 
m(x EO, 


and 


Tl 
p, =1-p,=1-— | f(x | ®)gq(8)d®, 


8€O, 


= | f(x] ®)g,@)a0 
m(x 8€0, 


where m(x)=| g(0)f(x|@)d0 


° 


=m, | f(x|@)g,(0)d0+n, | f(x | ®)g,(8)d0 
8€ Oo GeO, 


= T,M (x) +7,m, (x), 


and 


m(x)= [ f(x|@)g,(@d@  ; i= 1,2. 


0c 0; 


Therefore, the Bayes factor in favour of H, against H, is 
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| f(x | @)g,()ae 


B ch —_ Mo X) 
[ £6 |%g,@ae ™OP- 
9, 


Note that the likelihoods are now replaced by their respective marginals. 

Remark 7.8. We notice that the Bayes factor is the ratio of averaged likelihoods with the prior pdfs 
g,(8) and g,() serving as the weighting functions of H, and H,. This suggests that the Bayes factor 
cannot be considered as a measure of the relative support for the hypotheses provided solely by the 
data when H, and H, are composite. 


Remark 7.9. If 6, and 6, are the maximum likelihood estimators of 8 on 0, and oO. respectively, then 


the likelihood ratio defined as 


Supf(x | k 
SUPE) 644 18,) 

Supf(x|®) f(x |6,) (7.9) 
GeO, 


may be considered as a particular case of a Bayes factor when 7, and 7, are prior probabilities 


concentrated at 6, and 6, for the sub-parameter-spaces ©, and Re respectively. 


Example 7.7. Suppose an IQ test result, X, has normal distribution with unknown mean 8 and variance 
100. If the prior distribution of 6 is N(100, 225) and the observed score on the test is x = 115 then 
the posterior distribution of 8 is N(110.39, 69.23). 


Suppose we wish to test H,: 8 < 100 against H, : 6 > 100. Here, 1, = P(®<100) =1/2=1%,. 
Therefore, the prior odds ratio is one. However, p, =P(@<100 |x=115)=0.106 and 
p, = P(® >100| x =115) =1—p, =0.894. Therefore, the Bayes factor in favour of H, is 


Po _ 0.106 
y= = 
Pp, 0.894 
which is also the posterior odds in favour of H,. This suggest that the null hypothesis that the IQ of 
a person, having scored 115 on the IQ test, is less than 100 is much less likely than the hypothesis 
that the true IQ is greater than 100. 
Example 7.8. (Berger, 1985) The waiting time for a bus at a given corner at a certain time of day is 
known to have a U(0, ®) distribution. It is desired to test H,: 0 < @ < 15 versus H,: 8 > 15. From other 
similar routes, it is known that 0 ~ Pareto(5, 3). If waiting times of 10, 3, 2, 5, 14 are observed at a given 
corner, calculate the posterior probability of each hypothesis, the posterior odds, and the Bayes factor. 
The posterior distribution of @ is 


8(14)° 
2(6|x)= hun (0). 


The parameters of the posterior distribution are b = 3 + 5 = 8 and a = max(5,M) = max(5,14) = 14, M= 
max(X,, X,, X,, X,, X,). Therefore, the posterior probability of H, being true is 
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15 8 é 
_ | 84) ay 
= J Ap tun@nto=t-[ | = 0.42, 


and the posterior probability of H, is p, = 1 — p, = 0.58. Therefore, the posterior odds in favour of H, 
is O(H, |X) =p)/p, =0.72, and the Bayes factor in favour of H, is 0.72(1 — 2,)/1,, where 7, is the prior 
probability of H, being true. In particular, if 1, =, = 1/2, i.e., the two hypotheses, a-priori, are equally 
likely then the Bayes factor B,, is 0.72. Since B,, is near one we may say that the observed waiting 
times may not significantly alter beliefs about the two hypotheses. 


7.4 JEFFREYS’ APPROACH 


It should be realized that the Bayes factor is only defined when neither 7, nor 7, is zero. This 
means that if either H, or H, is a-priori implausible hypothesis then no amount of data can modify this 
information. In other words, if 2, = 0 then the posterior probability p, = 0. This situation arises when 
the parameter space is an interval and the prior distribution is chosen to be continuous, and we are 
interested in testing the significance of sharp null-hypothesis H, : 8 = 9,. 

In fact, when we wish to test H,: Oe 0, irrespective of whether 0, is a singleton set {9} or 
it is an interval (0,-€ , 0,+€ ), suggests that we have some belief in the truth of H,. In other words, 
the prior probability of 6 ¢ ©, is some positive number. Jeffreys (1961) suggested that we may assume 
a mixture of a discrete one-point distribution at 9 = 8, and a continuous distribution defined over the 
remaining space © — {0,}. According to him, such a situation arises when the investigator has a reason 
to believe, perhaps because of tradition or because of some support of physical theory, that the 
parameter 8 may have the specific value 0, of 8. The investigator, therefore, must investigate that the 
parameter 8 has some other value in the parameter space. 

Thus, prior distribution of 8 should be such that P(0 = 0,) = 2 > 0 and the remaining probability 
1—m is spread over the remaining parameter space ©, = © — {0,} in accordance with some prior pdf 


g,(8). This means that for any subset AC ©, P(@e A) = (i-m| g,(0)d0. 
A 


Let us define the prior distribution g(®) over the parameter space © such that 
(8) = mg, (8) + 1— mg, (8), (7.10) 
where 
1 if 0=0, 
0 otherwise 


g, (8) -| 


is the one-point distribution concentrated at 0 = 0,. In order to obtain the Bayes factor for comparing 
H, : 6 = 6, against the alternative H,: 0 # 8,, suppose we have an observed sample x from the 
population having pdf (or pmf) f(x|®). The posterior probability for H, is 

_ T(x | 85) 
m(x) 


where 


m(x) =| g(8)f (x | @)d0 


° 
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=n | g,(0)f(x|0,)d0+(1—m)[ 2, (O)F(x|@)d®, 
(8 } 9, 
=nf(x|0,)+(—2)m,(x), (7.11) 


where 


m, (x)= | g, (8)f (x | 0)d8 is the marginal density of x under H.. 
ch 
It may be noted that addition or deletion of one point does not affect the value of the integral. 
Therefore, we may replace ©, by @ in the definition of m,(x) without changing the value of the integral. 
Hence, the posterior probability of H, is 


_ _, mE(x|8,) 
p, =1-—p,=1 mie) 

_ 7 _ py (x) 

= (1-7) ai ; 


Hence, the Bayes factor in favour of H, is 


_{ n(x] 8,) m )_ £(x|]®) 
Bo . (7.12) 
(1—7)m, (x) 1-t m, (x) 
One can easily see that p, can be evaluated in terms of Bayes factor. We have 
-1 
oes ee 
0 me (7.13) 


Example 7.9. Suppose X ~ Bin(n, 0) and g,(8) = | for 0 # 1/2. We wish to test H, : 0 = 1/2 against 
H, : 0 # 1/2. Since 


and 
1 n 
m,(x) = |. g, (6) (x | 8)d8 = (Bath a= x+D. 
Hence, 
_ (1/2)" 
*  B(xtl, n-x4))’ 
and 


-1 
Po -(e( GE pact nxt) 
1 
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In particular, if both the hypotheses are a-priori equally likely, that is, 7 = 1/2, then for n = 5 and 
x = 1, B,, = 15/16 and p, = 15/31. However, if x = 3 is observed then Bayes factor B,, = 15/8 and 
P, = 15/23. Since Bayes factor is near 2, we may say that the data supports the null hypothesis. 
However, when x = 1, the Bayes factor was near | and, therefore, the data was unable to discriminate 
between H, and H,. 
Example 7.10. Consider the test H, : 8 = 2 against H, : 8 # 2 when the data is observed from the 
Poisson distribution with unknown parameter 0. If ¢(0) is Gamma(q, B) and x = 4 is observed then 
Bayes factor in favour of H, is 


_ £(x|9,) _ e?2* i e°0* B%e et! 40 
m(x) 4! 4! F(a) 


ol 
0 


_ 16e°T(a)(B+1)%** 
T(a+4)p* ; 


In particular, if © = B = 1, that is, the prior is g,(6) = e°, then the Bayes factor B,, = 2.89. Hence 
Pp, = 0.74 when 1 = 1/2. However, if the prior information is weak, we may represent it by an improper 
prior distribution. For the sake of illustration, let us represent improper nil-prior for 8 as g¢,(®) « 1/0 
which is obtained by letting a, 8 — 0. The Bayes factor also tends to infinity (For detailed discussions 
of such a behaviour see O’ Hagan (1994), page 192). 

Example 7.11. (Leonard & Hsu, 1999) Suppose X is the number of successes in n iid Bernoulli trials 
with probability of success 8. We wish to investigate H, : 0 = 9, versus H, : 0 # 0,, where 0, is 
specified in advance. Let us assume that the prior probability of H, is % and that of H, is (1 — 2) such 
that, under H,, 8 has a Beta(q, f) prior distribution. Using the discrete version of Bayes theorem, 


P(H, f(x | H,) TR 
Po =P(H, |x) = __ = —__., 
P(H, )f(x |H,)+P(H,)f(x|H,) mwR+(U-7) 
where 
R = f(x] Ho) = 05 (dl 8,)" * B(a,B) 
f(x|H,) B(a+x,B+n—x) 


and f(x | H,) is the marginal pmf of X when H, is true. But 


1 1 gxte 1-0 n-x+B-1 
F(x|H,)=| Fala one=[" | f = 5 


_(n )B(a+x, B+n—x) | 
x B(a,B) > X= 0 13.00, 


and 
p, = PCH, |x) = 1-PQ,| x) =1-p,. 
We shall say that H, is a-posteriori more probable than H, if PCH, | x) > PC, | x), that is, 


TR 1 


——_— > 
TR+1-nm 2 
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or 


l-x_ P(H,) 


m — P(Hy) 


In particular, consider X ~ Bin(n, 9), H,:9=1/2, H,:6#1/2. If the prior distribution g(®) is 
Beta(o, ), then for n = 10 and x =5, 


“(1 Baw _(1)° (FO) Tea+io) 
Po l2 | Bats.a+5) (2) To (T(a+5)) 


Using duplication formula 


1 on 
T =2"'T(mr —{T]— 
(2m) = 2 (m) [ms - I c } : 


we have 


_(1Y T@re@st11/2) 
° (2 } F@t5r(a+1/2) 


The limiting value of p, is (1/2)’.2* = 1 as & — (use the Stirling’s approximation). 

This suggests that H, is the sure hypothesis. This is intuitively logical because Beta(a, a) 
distribution has mean 1/2 and variance 1/4(2a + 1). As @ — 0, the prior distribution of 6 tends to a 
degenerate distribution centred at 1/2 since the prior variance tends zero. Thus, a-priori probability of 
the null hypothesis tends to |. If the prior probability of an event is one then the posterior probability 
cannot be different from one. 


If the prior for 8 is Bayes-Laplace uniform distribution, that is, @ = 1 then p, = 0.9. 


Result 7.1. Suppose T=t(x) is a sufficient statistic for the parameter © so that 


f(x |) =f(t|®)f(x|t), where f(x|t) is independent of 0, then 


m,(x)= | g,(@)f(x|®)d6 


9, 


=| g,(O)F(t] F(x | Hae 
9, 


=f(x|t)] g,(O)F(t| ode 
9, 


=f (x|t)m,(t). 
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Therefore, 
mf(x|6,) mf(x|t)f(t] ,) 
Pom) mo) 
and 
r ac a 
=U-T ‘ 
Py m(x) 
f(t 
Hence, the Bayes factor is By, = ( 180) (7.14) 
m, (t) 


Example 7.12. Suppose X,, X,, ..., X, is a random sample from N(®, 6) with o* known. We wish to 
compare H,:0=8, with H,:6#6,. Let us assume that 7m is the prior probability of the null 


hypothesis and g,(8) is the prior distribution of 9 under H,. A convenient distribution under H, of 0 
is N(y, 6). Since the values of @ in the neighbourhood of 6, are more likely than those far away, we 


may take v = 9,. In order to ensure that 8 may take values outside the interval (0@,—e€, 0,+€), we 
may choose variance ¢ such that Jo is much larger than 2e. Note that X-6~ N(0, o*/n) and, 


therefore, the distribution of K—@ is independent of 8. Since the distribution of @ is chosen to be 
N(O,, ) and X-@ and 80 are independent, the marginal distribution of X(=X-—0+6) is 


N(6,, (o° /n)+). Therefore, the Bayes factor in favour of H, is 


fp, tS) “(+ | on] eva / Ge i 
m, (x) o 2 no 
‘it 1/2 lL, os = 
“(+ 2) oo ts [+S] } (7.15) 


z=~n |x-6, |/o. 


Example 7.13. Suppose X has a normal distribution with mean @ and known variance 6?/n. The 


where 


maximum likelihood estimate of 6 is 6=x. If g,() is the prior distribution of 8 under H, :0#6, then 


m,(x)= | g,(@)F(X| 9)d®, ©, = {0:04 6} 


% 


Since 6 is the mle of 6, f(x | 6) 2=f(x|9®) for all 0€ ©,, and | g,(8)d6 =1, we have 


9, 
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m,(X) $£(% |X) =(2m0?/n) 


For z= Vn |x-9,|/o, 


By, = f(x] 0) zou a } 
m, (Xx) 2 


Thus, exp(—z’/2) is a lower bound for the Bayes factor irrespective of the choice of g,(8). 
In particular, if z = 2.576, then the Bayes factor is atleast 0.036 and the posterior probability of 
H, that @ = 9, is atleast 0.035. It may be noted that this bound does not depend on the sample size. 


Example 7.14. Suppose that X =(X,,X,,...,X,) is arandom sample from Bernoulli distribution with 


unknown probability 8 of success in each trial and we wish to test H, : 9 = 1/2 against the alternate 
hypothesis H, : 0 # 1/2. Since, we cannot assume any prior distribution for 0 under H,, let us find the 
lower bounds for the Bayes factor and the posterior probability of H, when P(H,) is 1/2. 

Recall that the sufficient statistic for the parameter 6 of the Bernoulli distribution is 


t(x) =) x =t (say). We have 


i=l 


n 


5 F(t] Oy) _ " 


Bo = £( 16) a ’ 
t ak a\n-t 
(8) 


since the maximum likelihood estimate of 6 is @=t/n. In particular, if we observe 15 successes in 20 


if (1-6,)"" 


Bernoulli trials, we have 
15 
2 
Bo (5 2° =0.07, 


and the lower bound for the posterior probability of H, is (1+ B3,) | =0.065, where prior probability 
of H, and H, are same. 

However, if our null hypothesis H, : © = 3/4 against H, : 0 # 3/4, then for the above data 
B,, 2 243. This is quite expected since the data supports the null hypothesis. 
Example 7.15. (Lee, 1997) In Example 7.12, if o* is assumed to be unknown, we may obtain the Bayes 
factor as follows: 
Let us consider that the prior for 0° is Jeffreys’ non-informative prior g(0*) « 1/o*. Then 


f(x|®,)=[ f(x, 07 |®,)do" 
0 


=| g(0°)f(x|®,,6°)do” 
0 
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o { (2)? exp| -4fs +n(x-0,)° 1 do? 


Writing t= (8, ~x)Vn/s , Where s? = S/(n — 1), we have 


2 —n/2 
ral@y=[ 14 : | 
n-l 


In order to obtain m,(x) under H,, let us assume, as in Example 7.12, that 8 ~ N(8,, 0). Thus, 


c 


f,(x|o°) = | g,(O)f(x|®,07)d0 


co 


« (6) 2"? | o| | ad x E G gy? + 2-90) || 
a o 6 


6 


oe [+22 os| 1 {s+ n(k-0,) i 
o 20 1+nd/o 


In particular, if we assume 6 = ko”, then 


co 


m,(x) =| g(67)f,(x [07 )do” 


0 


aoe 2. 1 n(x-0.)° ; 
x (ltnk)/?| (07) 2 ee St+ S do~ 
( ) J ) Pl 36? 1+nk 


2 —n/2 
mean" | | 
(n—1(.+nk) 


Note that f(x | 8,) and m,(x), both involve t-distributions with (n —1) df. Hence, the Bayes factor in 
favour of H, is 


_ f(x] 9) 
~— m,(x) 


er oe an(,, @+nkyt yr" 
— Gres [Jom pe] | (7.16) 


ol 
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Remark 7.10. For large n, the t-density function tends to the normal density function. Since the 
numerator and denominator of B,, are both kernels of t-density with (n—1) df, we have 


Bo = exo(-5 java cxn( Genk } 


which on simplification gives 


By, = d+nk)"”” oo +nk)" } (7.17) 


7.5 LINDLEY’S PROCEDURE FOR TEST OF SIGNIFICANCE 


Lindley’s procedure for test of significance is appropriate when prior information is vague or 
diffuse. Basically, the rationale for such a test is, accept H,:@=0, when the suggested value 0, for 


6 lies in an interval in which the posterior density is high and reject otherwise. It is based entirely 
on what is reasonable a-posteriori. No decision theoretic justification is available for this procedure. 
Lindley’s procedure yields results that are computationally equivalent to large sample theory test which 
utilize the large sample normality of mle, centered at the true parameter with approximate covariance 
matrix given by the inverse of the estimated information matrix. 

Some particular features of Lindley’s procedure are: 
(i) | All the sample information is employed as it is available in the likelihood. 
(ii) | The null hypothesis H, : 8 = 9, is judged on the basis of posterior probability. 

Lindley’s procedure for testing H, : 9 = 0, against H,: 8 # 8, at a-level of significance is as 
follows: 
Step 1: Derive the posterior pdf for the parameter using a diffuse prior. Suppose the posterior pdf of 
6 is f(6|x) which is unimodal. 
Step 2. Construct (1 — «)100% HPD credible interval (a, b) for 0. 
Step 3. If 6, falls in the interval (a, b), accept H, at a-level of significance, otherwise reject H,. 


Example 7.16. Suppose that our observations KX =(X,,X,,...,X,) are independently drawn from 


N(6, 65) with known o;. Under what condition should we reject the hypothesis H, : 8 = 8, at 5% 
level of significance when the prior beliefs about 6 are vague? 


The posterior distribution of 8, given x and 6 = 6,, is N(X,o,/n). The 95% HPD credible interval 


0’ 


fey fey 
for 0 is | x-1.96—“, x+1.96—= |. 
| vn vn 
The Lindley’s procedure will accept H, if 0, lies in the interval (x ¥1.966, /Vn) at 0.05 
significance level, otherwise we reject. In particular, if we are interested in testing H, : 8 = 1 at 5% level 
of significance and suppose n = 16, x = 2,0, =1, then 95% HPD credible interval for 0 is (0.01, 0.99). 


Since 8, = 1 does not lie in this interval, we may reject H, at 5% level of significance. 
Remark 7.11. Suppose one observes data X ~ f(x | ®) and is interested in testing H, : 8 = 6,. Fisher 
suggested the following procedure: 
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Step 1: Choose a test statistic T = t(X), large values of T reflecting the evidence against H,. 


Step 2: Compute p = P(t(X) > t(x)| H,), and reject H, if p is small. According to Fisher, this p-value 
may be viewed as an index of the ‘strength of evidence’ against H 
event and hence unlikely hypothesis. 

Example 7.17. Suppose X,, X,, ..., X,, and Y,, Y,, ..., Y, are two independent samples from 


» With small p indicating an unlikely 


N(@,,6;) and N(@,,63) , respectively. 


Case 1: Suppose o; and o; are known. 


Let us assume that 0, and 9, are a-priori independent and have non-informative priors 
g(8,) «. 1 and g(8,) « 1. Suppose we wish to test the hypothesis H,: 8, = 9, against the alternative 
H,: 0, #9,. 

We shall use Lindley’s approach to test the hypothesis at o—level of significance. Since the 
likelihood function of 8, and 8, is 


2 2 

1 nS X. 0 us y; (a) 

£(8,,9 X, «x exp] —— ) ede eel +) i 2 
cena 7p on | = om i 


1 @,—-x)° 0,-y) 
=e] {= = ane — i 
2 0; 0; ; 
and the joint prior distribution of 6, and 9, is g(0,, 8,) < 1, -co < @,, 8, < 0, the posterior distribution 
of 8, and 9, is 


2(0,,9, |x, y) x £(8,,0, | x, y)g(8;, 8, ) 


x exp| 2) i= *) || 1) n@ - yy" 
PI oO; Fig 0; 


Thus, the posterior distributions of 6, and @, are also independent and are N(x,o;/m) and 


N(y, 6; /n) distributed, respectively. If we denote 6 = 6, — 8, then the hypotheses may be rewritten 


o o 
ara For m= 12, n=7, 
n 


as H, : 6 =0 and H,: 6 #0. The posterior distribution of 6 is NF 
m 


x = 120, y = 101, Oo; = 457, and SG, = 425, the posterior distribution of 5 is N(19, 99). The 95% HPD 
credible interval for 6 is (191.9699) , that is, (-0.5, 38.5). Since, 6 = 0 lies in this interval, we shall 
not reject the null hypothesis at 5% level of significance that the two samples are drawn from the same 
population. However, at & = 0.1, the corresponding HPD credible interval for 6 is (19+1.6449,/99) , 
that is, (3, 35) and, therefore, we shall reject H, at 0.1 level of significance. 


Case 2: Assume that 67 = 6; =6°, where o” is unknown. 
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If 6,, 8, and 6” are a-priori independent such that g(6,) « 1, g(6,) «< © and g(o*) « 1/6’, so that 
g(6,, 8,, 0) x 1/0”. 


Let us define S, =} (x; -x)’, 8, =), -y)’, S=S,+S,v,=m-1,v,=n-1,v=v,+ 


i=l i=l 


v,. The joint posterior distribution of 8,, 8, and 07 is 


g(8,,0,,5° |x, y) x £(8,,8,,0° | x, y)g(®,,8,,0°) 


1 
207 


= G yz eo = ] oy" eo xe (x) | oy" eo xe (0, -y)° | 


x (0° |S)g(®, |x,0°)g(®, | ¥.°), 
where the conditional posterior densities of 8, and 0,, given o*, are independent 


PY Soy? exp|- {S+m(@, —x)’ +n(6, -¥)” i 


N(X,6°/m) and N(y,o’/n), respectively. The marginal posterior density of o? is 


Inverted-Gamma{ 5, a Therefore, the joint posterior density of 6( = 8, — 8,) and o? is such that 


g(8,0° | x,y) « (0° |S)g(8|x-J,0°), 


where g(5|X—Yy,0°) is N(X —y,o°/m+o°/n) density. In order to obtain the marginal posterior 


density of 5, let us integrate out 6? from the above density. We have 
g(5|x,y) =| g(8,0° | x, y)do°. 
0 


1 
207(m'+n 


« i (07) *? exp(—-S/207)(07) "7 ex} 1H (5 (x D) ja? 


T 2\-v/2-3/2 -_ 1 1 a 2 2 
=| (o’) ex a {sss (8-(x-¥)) |e 


1 (8-«@-yy CF) 


v+l 
5-(-¥) 


s¥m? +n 


, where s* = 


If we write t= = 
Vv 


Ve. 
, then 2(5|x, oe [+ 7 , which is a kernel of 
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Student’s t-distribution with v = m +n — 2 df. 


12 7 
For the data of Case 1, suppose we had wes —~x)? =5032 and Vv, -y) = 2552, then 


i=l i=l 


Vix, =a)" +)Cy; -y) 
5? = 4! : = = 446 and sVm'+n" = 10. Therefore, the posterior distribution of 6 
mt+n— 


is given by (6 -19)/10 ~ t,,, The 95% HPD credible interval for dis (19 = 2.11x10), that is, (-2.1, 40.1) 
and the 90% HPD credible interval for 5 is (19 = 1.74x10), that is, (1.6, 36.4). Thus, we may not reject 
H, at 5% level of significance but reject H, at 10%. 

Remark 7.12. The above example suggests that arbitrarily chosen level of significance may lead to 
different conclusions. However, the conclusions are similar to those of Case (1) and, therefore, it will 
not matter much whether we assume that the sample estimates of the variances are taken as known 
variances of the two populations or not, unless m and n are very small. 


Example 7.18. Let us consider the hypothesis H, :0; =03 against H,:0; #0;, in the framework of 
the above Example 7.17. If the joint prior distribution of 0,,0,,0,,06, is such that 
g(9,,9,, cae GC, )e (o; oh 3 that is, 6,,6,, oO; and 0; are a-priori independent and their priors are 
non-informative. Thus joint posterior distribution of (8,, 6,, 0;,05) is 


£(0,, 9, 0; 0; |x, y) x £(0,, 8, 0; 05 |x, y)g(8,, 8,, 0; 05) 


_xy2 32 
ox : : (62)™2(62)"" exp TS, ame x) 432 ta, y) 
195 2 oO; 0; 


On integrating out 0, and 0, we have 


y) 2 y) “Li 2a 1 S S 
g(9;,0; |x,y)% (07) * (03) ? oof-2{ 5+} 


In order to compare the variances of two normal populations, we may consider a variety of 
measures. For example, 05 /6/, 6,/6,, or logo, —logo,. 

Lindley’s approach requires non-informative priors for the parameters under consideration. Box 
and Tiao (1973) suggest use of standardized HPD credible intervals of logo, —logo, for construction 
of HPD credible intervals since HPD intervals are equivalent under linear transformations of 
logo, —logo,. 

In order to obtain the posterior distribution of the ratio of variances, oO; / 0; , let us consider the 


transformations $, =0;/0; and $, =03, so that, 


P) =1 
£(9,,0, es ap) = 9(0; ,0; ae (Sees 


(6.05) 
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o) 
= 8(0; 03 |x, y)] > 
~ | 5 


wait anf 588) 


0, 
since 
liek -02/(o?)? 
ate _ /6, —6; (07) _o 
A(G;.03)| | 4 0 b,° 
Therefore, 


co 


a(o,|x,y)~ |] 9(0,.0, |x. y)d0, 


0 


cs Pla {8425 


Ws S —v/2 
“24 
x b/7 [F+0 
2 


‘ tee ; : S,/v 
In classical statistics, we consider the ratio R = —!_! 


and = ,/R. The posterior distribution of the 
2 Vy 


transformed parameter © is 


° a S: =v /2 
(| x,y) (| [Si +8e 


Nan 4 
<p? (Vv, +ov,) 
Thus, the posterior distribution of is F-distribution with v, and v, df, and by symmetry, the posterior 
distribution of @"' is F with v, and v, df. 


do, 
do 


Remark 7.13. We notice that the posterior distribution of 0; /o;, based on locally uniform priors for 


0. 0, logo,, and logo, is numerically equivalent to that of F-distribution and, therefore, HPD credible 
intervals will also be numerically equivalent to classical confidence interval. 
Remark 7.14. The distribution of F is unimodal but asymmetric. We may construct the (1 — )100% 


HPD confidence interval (F, F) which satisfies the conditions 


@)  P(F<F)+P(F>F)=q, and 


—v/2 —v/2 
Wo /2 V5 Ev, /2 V25 
(ji) F 1+—F =F") 1+—F : 


Vy 
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A computer program may be developed to construct the required HPD credible interval. 

Example 7.19. (Lee, 1997) Lord Rayleigh conducted an experiment in which masses x (in grams) of 12 
samples of nitrogen obtained from air and the masses y of 8 samples obtained by chemical method 
within a given container at standard temperature were measured. The data were summarized to obtain 


X = 2.31017, y = 2.99247, s? =18.75x10°, s?=1902x10°. We find that R =19/1902 = 0.01. 


Therefore, the posterior distribution of , is such that = 100, ~ F(7, 11) and, o' ~ F(1, 7). Using 
the Table-V of Box and Tiao (1973), the 90% HPD credible interval for F(11, 7) is (0.32, 3.46) and, 


therefore, © lies in the interval ee 0.01 ) that is, (0.003, 0.031). Since, H, : oO; = 0; is equivalent 
3.46 0.32 | 


to H,:,=1 and since | does not lie in this interval, we may reject the null hypothesis of equal 
variances at & = 0.1 level of significance. 
Remark 7.15. The posterior distributions of o; /S, ando3/S, are independent and distributed as 
Inverted-x7(v,) and Inverted-x7(v,), respectively, therefore the posterior distribution of the ratio 
_ 63/8; 03/0; _ sy /s5 
o/s’ st/s? ar/o2’ 


s; =S,/v,; i=1,2, has the F distribution with (v,, V,) df. 
In the sampling theory, we have (s;/s3)/(o;/0;), in which s?/s> is the random variable and 
the ratio 0;/0; is an unknown fixed constant, has F distribution with (v,, V,) df. Thus, if the priors 


for (8,, 9,, logo,, logo,) are locally uniform then the posterior distribution of o3/o; is numerically 


equivalent to the F-distribution. 


Remark 7.16. In the example of constructing HPD intervals for o//o;, the standardized HPD interval 


will be the HPD interval for logo, — logo,. Since HPD intervals are equivalent under linear transformation 
of logo — logo,, atb(logo,— logo,), where a and b are two arbitrary constants, we may consider 


log F = log o; — logo; — (logs; —logs;) 


=2(logo, —logo,)+2log 2 
S 


1 
Fisher (1924) derived the distribution of logF as 
Wavy? FF"? 


cea ae al 
2°2 }}14+— 
Vv 


2 


f (log F) = 


—oo < log F< ce, 


Limits of HPD intervals of logF are available in Table-V of Box and Tiao (1973) for a combination of 
values Ot, Via Vs 
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7.6 LINDLEY’S PARADOX 


Lindley (1957) called attention to the paradoxical result that the non-Bayesian could strongly reject 
a sharp null hypothesis H, : 8 = 8, while a Bayesian could assign a non-zero prior probability 1 to H, 
and then spread the remaining prior probability (1 — 7) over all other values in a vague way and find 
high posterior odds in favour of H,. 

Jeffreys (1939, page 359-360) pointed out that Bayesian and Fisherian methods are incompatible 
in principle. An observation that may be called Jeffreys’ paradox. Jeffreys’ paradox is closely related 
to the fact that user of tail area probabilities can cheat and can reach arbitrarily small area probabilities 
if the investigator is allowed to use optional stopping even when the null hypothesis is true. This form 
of cheating is sometimes called sampling to a forgone conclusion. Good (1956) and Lindley (1957) both 
pointed out that Bayesian is better than Fisherian (a user of tail area probability) because optional 
stopping does not enable a Bayesian to cheat (the likelihood functions being same). 

For example, if one has observed 70 heads in 100 tosses of a coin, it cannot make any difference 
to the physical evidence about a coin whether the statistician is forced to stop experimenting due to 
limitation of time or money, that is, he fixed the number of tosses in advance or chose to stop because 
he wanted to toss the coin till 70 heads appeared. 

It could not be historically correct to call the more extended argument either Jeffreys’ paradox 
or Lindley’s paradox because Jeffreys didn’t say it at all and Lindley didn’t say it first. 

Lindley’s paradox is also sometimes known as 
(i) | Sampling to foregone conclusion 
(ii) | Jeffreys’ paradox or the Bayes / Fisher discrepancy, and 
(iii) Bayesian immunity to sampling to foregone conclusion. 

It may be noted that the Fisherian or Neyman - Pearsonian solution need not have a fixed sample 
size but must have a stopping rule that does not depend on the outcome of relevant observations. 

According to Shafer (1982), the essence of Lindley’s paradox is precisely the inability of the 
Bayesian theory to represent the strength of evidence. A relatively diffuse prior distribution g(6) can 
represent either very strong evidence (as when we know that if H, is true then 0 was chosen at random 
from the distribution g(®)) or very weak evidence (that is nearly complete ignorance about 8). In the 
first case, the Bayesian calculation is unexceptionable, whereas, in the second it is paradoxical. 

Bartlett (1957) considered a similar problem in which he considered the random variable X having 
N(0, 67) distribution, where 6? was assumed known. If the prior probability that 6 = 0 is m(40) and the 
investigator is quite vague about how close to zero 8 is when @ # 0. Let us represent such a vague 
prior information when 0 # 0 by taking g,(0) as N(0, ) distribution where 6 is very large. Since 


_ nf(x|@=0) 


P(8 =0| x) aie 


where m(x) = mf(x|6=0)+(1-—m)f(x|0#0). 


However, f(x |@#0)= | g,(®)f (x | 0)d® 


{0:00} 


= [ 2n(0° + o] ox] 
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Since, f(x|®#0)—0 and, therefore, m(x)=2f(x|9=0) as Oo. Thus P(@=0|x)—>1as 
¢ — co and this happens for all values of the data x and prior probability 7. 

O’ Hagan (1994) notes that similar problems arise whenever the prior distribution becomes 
improper over just one part of the parameter space or if different improper distributions are defined on 
different parts of the parameter space. 

Note that the prior distribution of 6 is effectively zero where the improper distribution lies and, 
therefore, we cannot take g(@) « | because it is not zero elsewhere (7 being not equal to zero when 
6 = 0). In case we are defining different improper distributions on different parts of the parameter space 
then g(8) — 0 at different rates for different parameter subspaces. Once again we are not justified to 
write g(8) « 1. 

The difficulty arises only when ¢ = co. However, if  # ©, 
= 

P(0=0|x) = [1+ td -md+ 9/0?) exp| ox*/{207(0" +o}]] 
is a decreasing function of | x | for fixed 2, 0”, and o. This is what we should expect. 

Another way to look at the paradox is to consider a case of large samples. Suppose we have a 
random sample x =(x,,X,,...,X,) of size n from N(6, 0”). Then, as before, 

-1 


P(0=0|x)= [1+ n'(1—m)(14+no/o?) exp|n gx? /20°(n"'0? +o}]| 


It is easy to see that it tends to zero as sample size tends to infinity for all me (0,1), 
o°,d>0, and x #0. However, if x =0, P(@=0|x) tends to unity as n> ~, 


Thus, for large samples, one can correctly identify whether 6=0 with probability one. This is 


perfectly reasonable. However, if, we define z=|X|/Vo°/n, then for fixed z, P(@=0|x) tends to 
unity as n — co and it is true for any z. 
The classical statistician would reject H, :6 =0, when z is large, whereas, the Bayesian will not, 


since he is convinced that this hypothesis is true (with probability one). Thus, we face an apparent 
paradox. According to O’Hagan and Forster (2004), the frequentist analysis is paradoxical, when we 


fix z and let n «©, we must have X near zero. It is more unlikely to have such an x if 640 than 
when §@=(. 

Example 7.20. Suppose that X =(X,,X,,...,.X,) are n independent observations drawn from a normal 
population with unknown mean 6 and known variance o? = 1. Assume that our null hypothesis 


H, : 8 = 0 and for some reason we consider that H, is true with probability 1, (>0). Let us further assume 
that the prior probability (1 — ,) is spread out uniformly over the interval (—a/2, a/2), that is, 


g,(8) =1/a, Oe (-a/2,a/2). The prior odds in favour of H, is 1,/(1-1,) and the posterior odds in 
favour of H, is 


f(t] ®,) 


a/2 


* [ g,(@fct|@do 


—a/2 


O(H, [x)= 
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= (J2n) ° exp{— 5 (20. ~x)? +i’) 


Sar] aL ttn wep 


—a/l2 


To 


Ty exp (—nx? / 2) 
1 —T, al2 1 n ee 
= -~(0- do 
| Lexp| 5 (8-8) 


—a/2 


> 


where t =x is the sufficient statistic for 0. 


a/2 


| “exp "0 x oo=1 
a 2 


-al2 


If x belongs to the interval (—a/2, a/2) such that sls 


that is, effective range of 0 is (—a/2, a/2), we have 


1/2 
Tl n = 
O(H, |x) =—2«a] — | exp(—nx?/2). 

(Hy |x) 1-1, [=| P| ) 


Denoting Z, = Vn X, the classical statistician would reject the null hypothesis H,: 8 = 0 at @ = 0.05 
if Z, 21.96. 
However, the posterior odds, when Z, =1.96, depends on the quantities Ty a, and n. For certain 


values of these quantities O(H, |x) can be large even though Z, =1.96 resulting in large posterior 


o | 


probability of H,. In particular, if we assume 1, = 1/2 and a= I, 
vhs 
O(H, |x) =,/-— oo(-} 
~ 20 


Thus, for n = 1, 10,100; O(H, | x) is 0.058, 0.185, and 0.584, respectively. The corresponding posterior 


probabilities of H, are 0.55, 0.156 and 0.369, for the same Z, =1.96. This suggests that in general 


a-significance level in a sampling theory test cannot be equated with the degree of belief in a 
hypothesis represented by a posterior probability. 


7.7 p-VALUE AND BAYESIAN SIGNIFICANCE PROBABILITY 
Example 7.21. Suppose X =(X,,X,,...,X,) is arandom sample from f(x |) = Oe: 0<O<co and 


it is required to investigate H, : 8 = 8, when g(8) is Jeffreys’ non-informative prior for 8. The posterior 


distribution of ©, given x, is 


_ (nx)" n—-1 (—nx0 
2@l9=TOy oe. 


244 Bayesian Parametric Inference 


Therefore, Bayesian significance probability 


®% oe 
P(O <8, | x)= | (ax) gle" gg 
» Tm) 
nOy)x nel 
~ J —_ (nx)"e* a for z=nx0 
0 I'(n) nx nx 
1 nOyx 
— n-l —2q = I _ : 
T(n) J ee Z nOox (n) 


which is a incomplete gamma function. However, the classical significance probability 


P(X <x|0=6,) =P(nX <nx|0=8,) 


_ / Ore (nx)"! 


| faa d(nx) 


since the sampling distribution of nx, given 0, is Gamma(n, 8). Now substitute 0,nx = Z, we have 


_ nox Ge ( Z n-1 1 1 Opnx 
P(X<x|O=0,)= | — dz = [ e°ZP" dz=T5,(n) 
> IM)|% 0, T(n) 5 ° 

Note that Bayesian and classical significance probabilities are numerically equal when the underlying 
prior is non-informative. 
Definition 7.3. (p-value) The p-value associated with the test is the smallest significance level a for 
which the null hypothesis is rejected. 

In other words, the p-value against H, is defined as the probability, when 0 = 8,, of observing 
an X atleast as extreme as the actual data x, that is, X = x. 

The concept of p-value was introduced by Fisher (1956). 


Example 7.22. Suppose X ~ N(@,6”) where o” is known and the prior distribution of 6 is non- 


informative, that is, g(@) « 1. Suppose we are interested in testing H,:@8< 90, against H,:0>6,, then 
the posterior probability of H, is 


6, 
y 1 1 6,—x 
= (2) 2 |\qa=-@p| 2 — 
Po J | oa? 0] = : 


Since, the posterior distribution of 8, given x, is N(x, 67), it is easy to see that the p-value (exact 
significance level) against H, is 

P(X >x|0=6,) =1-®((x-8,)/6) = ®((8, —x)/o)=py. 
Thus the generalised Bayes answer is the p-value. 

It may be noted that if the prior distribution of 8 is not non-informative distribution then the 


posterior probability of H, may not be the p-value. 
Remark 7.17. The use of non-informative prior appears to be disturbing since it gives infinite mass 


8) 
to each of the hypotheses (the prior probability of H, being true is | g(9)d8=co and also 


co 
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P(H,) = ©), resulting in prior odds indeterminate. Therefore, if we compute the Bayes factor as the 

posterior odds we may imply prior odds of 1, which may be contradictory. 

Remark 7.18. If we define significance level @ to say that a result is significant if, and only if, the 

p-value <@ then p, SQand p, =(09>0,|X=x)21-a. 

Remark 7.19. A philosophical criticism of p-value is that they depend not only on the observed data 

but also on the total sampling probability of certain unobserved data points which are such that 

X2x. This may result in different p-values for the two experiments designed differently but having 

identical likelihoods. For example, binomial and negative binomial situations lead to identical likelihoods 

though the experiments are different. 

Example 7.23. Suppose in 12 independent tosses of a coin, 9 heads and 3 tails are observed. We wish 

to test the null hypothesis H, : 9 = 1/2 against H, : 8 > 1/2, where 0 is the true probability of heads. 
We are not told in the statement of the problem how the experiment was conducted. Therefore, 

two possibilities arise, namely, 

(i) | The number n = 12 tosses was fixed before hand and the random quantity X was the number 
of heads observed. Then the likelihood function of @ is 


12), : 
@lx=9-[')p (-6)', 


since X ~ Bin(12, 8). 

(i) The experiment is involved in tossing the coin until the third tail appeared. In this case, the 
random quantity X would be the number of heads required to complete the experiment, so that 
X ~ NBin@, 9) and the likelihood function of 0 is 


U1 
Lelx=%[ pro-er 


Let us compute the p-value corresponding to the rejection region {x : X2c}, then for the first 
experiment 


12 12 12 
r[xzsjpos ey iG =0.75. 
2 ee he 


On the other hand, for the second experiment 


= (ately iy” 
P| x>9|e=+ =-y" +) =0.0325. 
2) Sy & fe 


If we are using significance level & = 0.05, we would reject H, if negative binomial model was assumed 
but not if it was the binomial model. This suggests that using p-value to test the null hypothesis 
violates the likelihood principle. 

Example 7.24. Suppose X,, X,, ..., X, is a random sample from N(9, 1) and the prior distribution of © 


is N(O, 1). If a sample of size n = 10 is observed yielding x = 1, the posterior distribution of 8 is 
N(10/11, 1/11). For testing H,:®<0 against H,:0>0, the p-value is 
P(X > x|0=0)=®(-1), 


since normal distribution with unknown mean @ is a monotone likelihood ratio distribution, whereas the 
posterior probability of H, being true is 


246 Bayesian Parametric Inference 


ee 10 
Te | oP eann a —E la 3). 


Remark 7.20. See, Robert (2001) for a general result. 

Remark 7.21. Neyman criticised p-values for violating the frequentist principle while Jeffreys (1961) 
felt that the logic of basing p-values on a tail area (as opposed to actual data) was silly. 

Remark 7.22. A common misinterpretation of p-values as error probabilities very often results in 
considerable over statements of the evidence against H,. 

Remark 7.23. In a number of one-sided testing situations, vague prior information will tend to result 
in posterior probabilities that are similar to p-values. This is not true for all one-sided testing problems. 
For instance, testing H, : 8 = 0 versus H,: 8 > 0, p-values and posterior probabilities will tend to differ 
drastically. 

Example 7.25. (Berger, 2003) Suppose the data X,, X,, ..., X, are iid from N(@, 67) distribution with 


6’ known and n = 10. It is desired to test H, : 8 = 0 versus H,:0#0. If Z= Jnx/o=2.3 then p- 
value is 0.021. 

However, using Jeffreys’ approach with equal prior probabilities of 1/2 each for H, and H, and 
using N(0, 6”) prior on H,, we have Bayes factor equal to 0.2995 and, therefore, the posterior probability 
of H, being true works out to be 0.2305. We notice the discrepancy between the numbers reported by 
p-value and Jeffreys’ posterior probability of H,. 


7.8 DECISION THEORETIC APPROACH TO TESTING PROBLEMS 


Suppose we want to test H,:9¢ ©, versus H,:0¢©,, where ©, is the complement of the set 


©,. Let the action a, be accept hypothesis H,, i = 0, 1. The action space c%/ consists of only two points 
a, and a,. Let us consider the ‘0-1’ loss function 


0 if 0c, 


L(@,a,) = 
a) i if @€ @,; ij=1,2;i4j. 


If the posterior distribution of 0, given x, is g(® | x) then the posterior expected loss of action a, is 


Po =] L(G,a,)g(8| x)a0 


e 

= | L@,a,)g(0| x)d0+ | L(8,a,)g(0| x)d0 
(ory 9, 

= | L(®,a,)g(0| x)d0 = P(Be @, | x), 
e, 


and, the posterior expected loss of taking action a, (that is, rejecting H,) is 
Pp, =P(GeE O, | x). 
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According to the Bayes Principle, we should reject H, if p, > p,. Hence, the critical region for the Bayes 
test is 


C={x:P(8e ©, |x) >P(8e O, | x)} 


={x:P(O, |x) >1/2} 
Example 7.26. Suppose X,, X,, ..., X, is arandom sample from N(9, r) with precision r known and the 
prior distribution of © is N(u, Tt). We wish to test the hypothesis H, :6< 90, against H,:026,. Since 
the posterior distribution of @ is N(u,, T,), we should reject H, if 


7 |, Ty, 5 1 
— -—(0- d@>—, 
ima * Hy | ; 


where T,=T+nr and WW, =(tW+nrx)/T,. 


Hence, the rejection region for the Bayes test is 
Cc ={x :&(/t, (@,-n,))< 1/2}, 


where @(-) is the c.d.f of standard normal variate. In particular, if 8, = 0, T= 1, r= 1, uw =0 the rejection 


region will be 


no _ 1 
c={vo{ xp 4] 


which is similar to the classical uniformly most powerful critical region 

C={x:x >0}. 
Remark 7.24. The Bayes test depends on the prior distribution and the loss function whereas the 
classical test procedure depends on the specified size o of the test. 


Example 7.27. Suppose X,, X,, .... X, are n iid Bernoulli random variables with probability of success 
6. Let the prior distribution of 0 be U(O, 1). We wish to test the hypothesis 


H, :9<1/2 versus H,:@8>1/2 under the ‘0-1’ loss function. Since the posterior distribution of @ is 


Beta (x +1, n-x +1), where x = yx , the rejection region for the Bayes test is given by 


i=l 
P Gels = 
2: 2 


1 x _ n-x 
a= 8) dase’ 
B(x+1,n-x+l) 2 


that is, 


1/2 


Suppose in five trials, 4 successes were observed. Then 


1 44 
pPla>t{x=4|= { 24-® a9-0.89 
2 2, BG,2) 
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which is greater than 0.5, the Bayes test will reject H, under the ‘0-1’ loss function. 


Bayes test under asymmetric loss function 


it 0 if 0E9, 
Oe ape O,; j#isij = 12. 


such that k, and k, are positive numbers. 
We wish to test the hypothesis H,:6¢ ©, versus H,:0¢@,. If the posterior distribution of 
8 is g(8 | x) then the posterior expected loss for taking action a, (accept H,) is 


py =| L@.a,)g(0| x)d0 


9 
=k, | g(®|x)d0=k,P@e ©, |x) 
9, 
and the posterior expected loss of taking action a, is 
Pp) =k,P(e ®, | x). 


Therefore, the rejection region for the Bayes test is 


c={x:P(0¢ @,|x)> Ei } 

k, +k, 
Example 7.28. A large shipment of parts is received, out of which 5 are tested for defects. The number 
of defective parts, X, is assumed to have Bin(5, 9) distribution. From past shipments, it is known that 
6 has a Beta(1, 9) prior distribution. Let a, denote the action ‘decide 0<0<0.15’ and a, denote the 
action ‘decide 8 > 0.15’. If X = 0 is observed, find the Bayes action under the loss function 


1 if @>0.15 
L(0,a,) = : 
0 if 6<0.15, 
and 
2 if @<0.15 
L(6,a,) = ; 
0 if @>0.15. 
Solution. Since the posterior distribution of 8, given x = 0, is 
_ 93 
Feleao= ©) : 0<@6<\1, 
Bd, 14) 


which is Beta(1, 14) distribution. We shall reject H, if 
P(O>0.15|x =0) >2/3. 

(- 6)" 

Bd, 14) 


However, if we change the loss function to 


1 
Since P(@>0.15| x =0) = | 


0.15 


d@=0.10 is not greater than 2/3, we decide that Qe (0,0.15). 
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1, if 6>0.15 
L(0,a,) = . : 
0, if 8<0.15 
and 
0.15-0 if 8<0.15 
L(6,a,) = . 
0 if @>0.15 


then the posterior expected loss for taking action a, is 


py =| L@,a,)g(@| x)a0 


° 


_ { (1—@)" 


dé = 0.103, 
B(1,14) 


0.15 


and the posterior expected loss for taking action a, is 


p, =] L@,a,)g(0| x)d6 
e 


7 | iso) 
J B(L14) 


d@ = 0.09 


Since, P, > P,, we reject H, and decide 0 > 0.15. 
Example 7.29. Suppose X,, X,, ..., X, is a random sample from N(8, r) with known precision r. We wish 
to test the hypothesis H, :6 =0, against the alternative H,: 6 #8, under the loss function 


L(6,a,) =c(®-98,)’, —00 < 9 < co, 
and 
0 080, 
L(8,a,) = 
b 6=6,, 


where c > 0, b > 0. 

Following Jeffreys’ approach, let us assume that the prior probability of H, is 
m >0 and remaining (1 — 7) is spread out according to the prior distribution g,(8) where 0 # 0,. The 
posterior expected loss for taking action a, is 


Po =f L(®.a,)g(0| x)a6 


° 


] L@,a,)g,@|xd0+ |  L(®,a,)g,(0] x)a@ 


0=0, ©-(0} 


=c [ (8-0,)'g,(0|x)d@+c | (0-@,)"g,(0|x)d0 


0=0, O-{Q% } 


=c(l-m) [|  (0-6,)'g,(@|x)d0 


©-{8p} 
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where, 
mE (x | 0) under H 
m(x) 
0 = 
BOI) 1 ecw 18g, 0) under H 
m(x) : 


m(x) = "(F(x |®,)+(-m) | f(x|(8)g,(@)d® , 


O-{89} 


and the posterior density under hypothesis H, is g,(6|x), i = 0, 1. Thus, 


oye] | 0-1)", 


O-{8o } 


ccm Pvt, -,'| 


Note that under the alternative hypothesis H, :@¢ ©, =©-—{0,}, the prior density is g,(0) which is 
assumed to be N(u, T) and the likelihood function is 


L(8| x) exo Sh e-0 } 048). 


Since, a singleton point {8,} taken away from the parameter space © does not affect the value 
of the integral, the posterior density under H, will be N(u,, T,), where p, =(tu+nrx)/t, and 
T, = + nr. The posterior expected loss for taking action a, is 


p, =| L@,a,)g(0| x)ae 


(2) 


= | L@,a,)g(@|x)d0+ | L@,a,)g(| x)d0 
9, 


0=05 
=b | g(8| x)d0 = bp,, 
0=0) 
where p, is the posterior probability of H, being true. Thus, the Bayes decision is the one for which 
the posterior expected loss is smaller. 
Example 7.30. Let us consider the situation in which the normal precision r is also unknown and we 
wish to test the hypothesis H, : 8 = 8, against the alternative hypothesis H, : 0 # 8,. Consider the 
loss function 
L(0,r;a,) =cr(@-6,)°,  -22 < 8 < 00, 
and 
0 6#80, 
b 6=6,, 


where c > 0 and b > 0. 


Lesa) =4 


Hypothesis Testing 251 


Note that the parameter space Q = ©xR is the set of all points (9, r) such that 6 € (—co,00) and 


re (0,0). The parameter space Q, under H, is the line containing the points such that 6 = 9, in the 
upper half plane. The complement subset ©, of the parameter space, specified by the hypothesis H,, 
contains the remaining points of the upper half plane. The prior probability of H, could be zero under 
any joint pdf of © and R defined over the whole parameter space Q. We assume that this probability 
is 1(> 0). The distribution of this probability over Q, may be described in terms of a conditional pdf 
g(r | 8 = 8,), defined over the line, which we may take as Gamma(q, 8). The remaining probability 
(1-1) is distributed over Q, according to g¢, (0, r) which is Normal-Gamma(uL, T, 0, 8) distribution, given 
by 


g,(8,r) = g(8|r)g(r) 


= wT wr 2 B° a-1_—Br 
(Een 2 Ow) oy =“ 


The joint posterior distribution of @ and r, given x, is 


mf (x |8,,r)g,(r|9=9,) 


under H, 
g(0,r | x)= re 
COE EMO  sieeay., 
m(x) 


The posterior expected loss under H, is 


Po = } i) L(0,r;a,)g(8,r | x)d@dr 
Q -co 


=[] | LO rsa,)go(O.r]x)d0+ | L(G,r5a9)g, (0,1 | x)d0 {dr 
0 co 


0=0, 


= | | cr(0—6,)°g, (8,1 | x)d@dr 
0 


co 


= (1-n)cE| r(0-0,)° |0 #9, |, 


where the expectation is taken with respect to the joint posterior density, g,(8,r|x), of ® and r 
(0 #6, andr > 0). 


= (I-n)cE| rE {(0-6,)° Bitea 6, | 


= amet } + (HL, -0,)} O# a 
(t+n)r 


Since the conditional posterior density of 0, given r, under H, is N(u,, (t + n)r) with 


uu, =(tU+nx)/(t+n), we have 
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Py =( me : +(H, 0) E(r18#8,)}, 
T+n 


where E(r | 8 # 9,) is taken with respect to the marginal posterior density of r, which is Gamma with 


i _.5  t(X—B) 
parameters & = + n/2 and B, =B+ ) X,-X)' + : 
Bi=B+5 Si ae 2(t+n) 


1 > Oo 
Hence, =(1-= + 6,)°> — |, 
ence, Py =( | : (LL, te) B, | 


and the posterior expected loss for action a, is 


Pi = I L(0,r;a,)g(8,r | x)d@dr 
0) 2 


=o f g,(0,r| x)dOdr 


0 0=0) 


b * 
G) Jats |6,.r)g,(r] 0 = 8, )dr 


m 


= p ME | 80) 
m(x) ”’ 


where 


m(x) = mf(x|®,)+(-m| J £6] @.ng,(0,r)d0dr 


0 620, 


and 


f J f(x | 0,r)g,(0,r)d@dr 


[= 1? ge [ittet | [EP exp 1+) @_ 1)? |dedr 
ttn\|2n] TI (a)’o ak : | 
= {= Ly" BY T@,) 
Tin\| 20 TQ) B," 


In particular, if our observed sample of size 10 is such that x =1, yx =15, the prior 
i=l 
hyperparameters are © = B = 1, u = 0, t = 10, m, = 1/2 and we wish to test H,: 8 = O against the 


alternatives H,: 8 # 0, then p, = 0.150b and p, = 0.149b. Therefore, we shall reject H, if p,> p,, that is, 
if c/b > 0.993. 


Chapter 8 


Predictive Inference 


8.1 INTRODUCTION 


The purpose of statistical analysis is not only to understand and interpret the observed data but also 
to make inferential statements about unknown parameters of the model so as to inform the client about 
what is likely to happen if the experiment is performed again. Infact, the client may be more interested 
in direct useful inference which tells him about what is likely to happen when future experiments are 
performed. 

Any inferential problem whose solution depends on some future occurrence is a problem of 
statistical prediction. It is a practical problem which involves uncertainty. The predictive inference will, 
therefore, require probability calculus and statistical tools to formulate and solve any prediction problem. 


A classical statistician solves the problem of prediction by ‘plugging-in’ the estimate 6 of the 
parameter 9 in the distribution f(x|6) as if it were the true value of the parameter and use estimated 


distribution f(x| 6 ) to make inferential statements about outcomes of the future experiment. 


Aitchison and Dunsmore (1975) and Geisser (1993) advocated that inference about unobservable 
model parameters has no direct relevance to decisions since they are simply theoretical quantities. They 
argue that if the loss function represents a loss that the decision maker can actually incur at some point 
in the future it must depend on quantities whose values are known at that time. Furthermore, inferring 
about observables is more relevant since they can occur and be evaluated to a degree that is not 
possible with parameters. 

Suppose that x represents a set of available data and y is an independent set of potential future 
data. The problem of prediction amounts to obtaining an expression for g(y|x), the probability 
distribution of y conditional on x. Defining g(y|x) = f(x, y)/f(x) without reference to the parameter is not 
practical since one may observe that x and y can not be independent as x has to provide information 
about y and it is very difficult to specify joint distribution of dependent variables unless parameters 
are introduced. 

The conventional Bayesian approach to predictive inference assumes a parametric model f(x|6) 
for observables and prior distribution for the unknown parameters. All the parameters of the model f(x|®) 
are considered as nuisance parameters and the (posterior) predictive distribution, g(y|x) can be derived 
by integrating out 0 from their joint posterior distribution. Thus, 


a(y|x) =|. e¢y,8| x)d0 


= [to |, x)g(8| x)d0. (8.1) 
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If we consider the data x and the future observation y independent of each other, for a given value 
of 9, then 


f(y|®, x) =f (y|®) 
and, therefore, 


a(yix)=| f(y] ®)g(0| xd. (8.2) 
° 
The predictive distribution g(y|x) summarizes the information concerning the likely value of a 
future potential observation given the likelihood, the prior, and the data we have observed so far. In 
case, no historical data x is available, we may write 


a(y)=| f(y.)d0=[ f(y] @)g(@)d9, (83) 


9° ° 

where g(8) is a prior density of 8. The marginal density g(y) is known as prior predictive density. It 
is prior because it is not conditional on previous observations and predictive since it is a density for 
an observable quantity. If g(@) is a parametric density having hyperparameter(s), say o, then g(y) will 
have parameter(s) . 

In general, if the data x consists of n independent observations X,, X,, +, X,, from the common 
pdf f(x|®) and x, 
then 


.. X.,_ , for m>1, n2=1, are the m future iid observations from the same population, 


+1? n+m 


m+n 


[[] te 1oe@c0 
i=l 
2(Xpapoee Xptm | Xp X20 Xq) = 8 = 
[[ [tc eee 
@ i=l 
m+n 
=( [] £6 | ®@ | x:....xn)a0 (84) 
@i=n+tl 


Result 8.1. If X,,, and X =(X),...,.X,) are independent observations from f(-|8), with mean 0 and 


1 
variance 0”, then 


E(X,,,|x) = E[E(X,,,|@, X)|x] 
=EL [E(X,,,|®)| x] X 
= E(x), 
and Var(X,,,|X) = E[Var(X, 0, X)| x] + Var[ECX,, |@, X)| x] 
= E[Var(X, ,|)|¥]1 + VarlE(X, || ¥] 
= E(o| X) + Var(6] x). (8.5) 
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Remark 8.1. In particular, 


a(X 4/4) = [FX .1,8| xde 
= [eer | 6, x)g(0 | x)d6 


= | f(X41|0)g(0| x)d8 = ELF (X,,, |] (8.6) 


where the expectation is taken with respect to the posterior density g(6|x). 
Remark 8.2. If parameters are regarded as neither meaningful nor necessary then the predictive 
distribution of y, given x, is 


f(x,y) 


g(y|x)= (x) 


, where f(x)=| f(x, y)dy, 
Y 


without reference to a parameter 8. However, if the data x is to provide information about the future 
observation y, one cannot treat them to be independent. Any reliable subjective evaluation of joint 
distributions of non-independent variables is extremely difficult. In other words, it is not practically 
feasible to write down f(x,y). 

Introduction of parameters in a parametric model is a natural way of representing the joint 
distribution of the data. 

Remark 8.3. The posterior distribution of 8, given X,, Xj, +) X, provides a summary of the experiment, 
whereas, the predictive distribution a(x, IX, .. X,) is specific to the (n+1)th observation. 

Let us consider a simple coin tossing experiment in which we have observed number of heads 
in n tosses of a coin. Our interest is in finding the probability of getting a head in the next toss of 
the coin or number of heads in the future m tosses of the coin. 

To illustrate the difference between classical parametric approach and the Bayesian approach, let 
us consider an experiment in which three heads were observed in five tosses of a coin. The classical 
Statistician may use maximum likelihood estimate of the probability of getting a head on a toss and then 
use it as if it were the true value of the probability of getting a head. Thus for the next toss he will 
consider the underlying distribution to be Bernoulli with probability of success as 3/5 = 0.6. He will, 
therefore, say that the probability of getting a head in the next toss of a coin is 0.6. However, a 
Bayesian will assume a prior distribution for the probability 6 of getting a head as a beta distribution 
with the hyperparameters (a, B). In case, he is not sure about the nature of the coin, he may very well 
assume a non-informative prior distribution for @. It may be recalled that most of the non-informative 
prior distributions for 8 may be represented by a beta distribution. For example, the Bayes-Laplace 
uniform prior is a beta distribution with o = B = 1, Jeffreys’ non-informative prior is a beta distribution 


with @ = 8 = %, and the Haldane’s nil-prior is a limit of beta distribution as & +0 and B 0 (which 
happens to be an improper prior). 


To fix up the ideas, let us take the prior distribution of 6 as U(O, 1). The predictive distribution 
of the sixth toss, given the outcome of the first five tosses, is 


1x6 PU-8) 


1 
X¢|X],Xo.-.X5)=] 06-06 
2(X6 | X1,X2 5) | (1-8) B43) 
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= B(X6 +4,4—X6) , x = 0, 1. 
B(4,3) 


Note that we are denoting x, = 1, if head appears and x, = 0, if tail appears on the ith toss (i = 1, ..., 6). 
Then 


5 


y°X; =3 |= BO?) 4 29.57 
a B(4,3) 7 


P| X6=1 


However, if we had observed the tosses of this coin in the past and our feeling is that the most 
probable value of 9 is 1/2 and we are ready to bet reasonably large amount of money on this, we may 
formulate the prior distribution to be a symmetric beta with a = B such that its variance is very small. 
For example, we may take o& = 100 so that its variance is 0.0012. The predictive distribution of x, is then 


B(x¢ +103,103— 
yx, =3|- (x6 +103,103-X6) gy 
B(103,102) 


and, therefore, the probability of getting a head in the sixth toss is 


5 
yes | B(104, 102) _ 9 


P| X5=1 
B(103, 102) 


502. 


i=l 

This suggests that the prior information about the nature of the coin should be considered in 
order to obtain the realistic value of the probability of getting a head in the next toss of the coin. 
Remark 8.4. It is important to realize that the procedure adopted by classical statisticians in which one 
obtains estimate of the indexing parameter for some class of distributions describing the experiment and 
subsequently using the estimate as if it were the true value is quite naive. The discussion in Remark 
8.3 suggests that such a “plug in” approach may be in-coherent from Bayesian viewpoint. It is 
unfortunate that statisticians have criticized this approach in simple practical situations but have not 
followed the Bayesian predictive approach to solve complicated problems. 

In the above example, if our prior information is that the coin is a fair one without any doubt 
then our prior information about the parameter 9 is P(® = 1/2) = 1. The posterior distribution based on 


the data x =(Xj,...,X,) obtained from n tosses of the coin will be 


P(O=1/2)¢(8|x)/m(x) if @=1/2 
P(O#1/2)0(8|x)/m(x) if 01/2 


Xj n—-Lx; n 
1 : i z if @=1/2 
_}l2 2 2 
= yi 
o/() if @41/2 


\ if 0=1/2 


2001-4 


O if 0#1/2, 
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since the constant of proportionality m(x) is 


m(x) =0(0- iG } J e@r|9a0 
; {0:01 /2} 


2X; n—-Lxj; n 
2 2 2 


The predictive density function of the future (n+1)th observation is 


1 
a(xnat 19=] Fn 18g] x00 
0 


1 


1 Xn+l 1 I-xn41 
= 1-= 6|x)d0 
(g(a) eee 
7 1 Xn+1 ok I-Xn+] i | 1 Xn+] roe I—Xn+] (6| x)d0 
2 2 . {9:041/ 2} 2 2 ° : 
Xn+1 I-Xn4] 
= 1 it +0 
2 2 


if Xpy1=1 


Thus, 


a 
g 
8(Xnu1 |X) = i 


if Xq41 =0. 


Remark 8.5. It illustrates that if we are a-priori sure of the coin being fair then, irrespective of the 
outcomes of the past tosses of the coin, neither the posterior probability of © nor the predictive 
probability of getting a head in the next trial will change. 

In general, if the prior distribution of 6 is a degenerate distribution, i.e., P(@ = 8,) = 1, then the 
posterior distribution of @ is 


1 if 0=6, 


2(Xn41 '9=19 if 046, (8.7) 


and the predictive density function of the future (n+1)th observation is 


O if Xa4=1 


8(Xn41 |X) = ‘i 265. 30 ei: (8.8) 


which shows that irrespective of the number of observations or their values, neither the posterior nor 
the predictive beliefs will change. 
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8.2. STANDARD PREDICTIVE DISTRIBUTIONS 


Example 8.1. Let X =(X,,X>,...,.X,,) be a random sample from N(0,67) , O° known. If the prior 
distribution of @ is g(@) « 1 then the posterior distribution of @, given X =x, is N(x, 6 /n). The 


predictive distribution of the future independent observation Xi from N(6, 6°), o known, is 


axnuls= [ fn11e|xd0 


—oo 


«| exo {n+ - 9) +n(@-x) *} a0 


2(n+l)o 


which is N(X, 07(n+1))/n). 


Example 8.2. Let X = (X , X9,...,X,,) be a random sample from N(9, r), 8 and precision r both being 
unknown. If the joint prior distribution of (0, r) is g(®, r)<1/r, then the joint posterior distribution of 


(9, r), given X=x, is 


2(0, igyeea exp tc 6) 2 (x; — x) i 


i=l 
Hence, the predictive distribution of Xp given X=x, is 


co CO 


&(Kn41 |X) =| | g(O,r| x)f(x,4; | 9,1)dOdr 


0 —0o 


moo nd a 
«| fr? 2 exp ~o aso? (x; “| 5 nel @)? | d@dr_ 
0 i 


—oo 


On using the identity 


n(X—0)* +(xp4, -9)* =(n+1(0—-0 )* +n(x,4) —X)? Mn FD), 


we have 
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a(Xpa1 |x) « | exp] —Zcn10-0")? ao 


co n 
5 rly n 
2 Le __yy\2 _>)\2 
I exp “} (x; —x) toa X) d@dr 


i=l 


‘i —n/2 
x x (xX; x)? + = (Xy4 a 
: n+l 


? 


_9 Ta-14)/2 
=|} n 1 fat] 


n+ln-1 52 


n —= 
= ; + 
where (n—1)s* =)" (x; -%), and 6° =*—*n# 


J n+l 


Thus, g(x,4;|X) is a 3-parameter t-density with (n—1) df, location parameter x, and scale parameter 


(n—1)(n +1)s? 
n(n—3) 


, be independent random variables having a common pdf 


n/((n+1)s?). Its mean is X and variance is 


Example 8.3. Let Xie X 
f(x |@)=0e ™; @>0,x>0. Assume 2(0) « 9% !eB9, a>0,B>0, where @ and B are the 


n+r 


hyperparameters. Then g(0|xj,...,X,) is the Gamma(a+n, B+nx) distribution and the posterior 


predictive distribution of the future m independent observations (XK yotX pen) iS 
(0. ene Peery. ene | eee | f (Xp4po-sXnam | 9)g(0| Xq,....X,)d0 
0 
_ T(n+m+oaj(P+nx)"*% 
a+m-+n ” (8.9) 


T(n+)(B+0X+Xy4j +Xpgo +--+ Xpam) 


and the prior predictive density of X_, X__, .... X__, given (a, B), is 


n+1? ~“n+2? n+m 


Bo T(a+m) 
TQ) (B+ xn4) t--+Xpam) 


(8.10) 


&(Xn41>---sXn+m | OB) = Gan - 
In particular, for n = 0, that is, if no past data are available, the posterior predictive density is exactly 
the same as prior predictive density of (X Xx). 


n+]? ** n+m: 
For a non-informative prior, i.e., «& = B = 0, the prior predictive density g(X)4),---sXn4m | B) 


does not exist. However, the corresponding posterior predictive density function does exist. 
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Also, one may observe that for m = I, the posterior predictive distribution is a 3-parameter 


n 
Inverted-Beta] 1, a+n, p+)" x; |. 

i=l 
Example 8.4. Let X =(Xj, Xo,...,X,) be a random sample from the Pareto(k, 8), k known, having 
density function 


ek® 
r: 6+1 


f(x |@) = x>k, 0>0. 


The conjugate prior for 8 is Gamma(a, 8) and the posterior density of 0 is 


n 
Gamma} +n, p+)" log(x; /k) | . Hence, the predictive density function of X,,,, given X=X, is 
i=l 


O+n 


n 
B+)” log(x; /k) 
i=l 
(Xp |X =x) = erat (8.11) 
n 
B(,a+n)x,44| B+ y log(x; /k) + log(Xy4,/k) 
i=l 


Thus, the probability that the next observation will exceed ke* is 


n 
B+ » log(x; /k) 
P(X 41 > ke" |x) = ish 


B+)” log(x; /k)+d 


i=l 


O+n 


Example 8.5. Suppose XK =(X,,X9,....X,) is a random sample from the Pareto density 


f(x |8)= a ees. 0@>0. The conjugate prior for 0 is 
Xx 


Bop 
2(0) =, 0 T(o,m) ) , (8.12) 
m 
and the corresponding posterior distribution of 0 is 


HEB cee: 
g(8| x)= ate ~ Tom, ) (0), m= min(X,m). (8.13) 
1 
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Therefore, the predictive density function of Xp given x, is 


xn 1= | Ons 1@8(@] x40 


0 n- 
=| ee 0) Xn) Ba sa! oh ‘Tom, )(8)d@ 


0) ea 
+ 
ina { of a0 if m<x 
m?*"x2 (a0) 
n+l 0 
4 Xa 
Bin fede if m> xy, 
Xa) Xn 0 
pee ~ if m< X(1) 
Bonlg= 
&(Xn41 |X) = 
n+l | B+n X(1) : (8.14) 
B+n+1 2 if Mm > X(q)- 
Xn4+1 


Example 8.6. Let X =(X),X>,...,X,) be a random sample from Pois(®) and suppose that the prior 
distribution of 8 is Gamma(q, 8). If X,,, 18 a future independent observation from Pois(8) distribution 
then the predictive pmf of X_, given X =x, is 


n+l? 


8X41 | x)= | f(Xp4y | 0)g(0 | x)d0 
0 


- e°Q™ ee (B + ae 7 ae ae 


» Xn! rary a 


dé 


= (B+n) : | ectidenng aa 


rat xn.) 


ay XitXnal 
oe) (B+n+1) ' 
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" B+n+1 


n 
wie isnin[ arn Ee } 
I 


Example 8.7. Let X = (Xj, X9,...,X,,) be a random sample from Bin(k, @) and suppose that the prior 
distribution of @ is Beta(a, B). If X,_, 
distribution then the predictive pmf of X 


is a future independent, yet to be observed, from Bin(k, 8) 


given X =x, is 


n+l? 


1 
2%n41 [X=] £On41 [8] xd 
0 


a A 
xia Benk-)" xj-1 


oy 
1 k 1 = 1 
=| p- (1 gy é (l 9) - do 
mee B() x, +0,8+nk-J) x 


=0,1,..,k (8.15) 


k B(d x,+04+x,,, B+(nt+Dk-Y x, -X,, 
B(E xX, +0, B+nk-) x] 
It is known as beta-binomial distribution. 


The mean and variance of the beta-binomial distribution may be obtained by using results on 
iterative expectations as follows: 


E(Xni1 |) =EBEKna1 18 919) 
= E(k0 |x) 


nifardis |forrbm ; (8.16) 
i=l 


and 
Var(Xy41 |X) = E(Var(X,, 4; | 8, x) | x) + Var(E(X,, 4; | 8, x)| x) 
= E(k@(1-6) | x) + Var(k@ | x) 


uty | fax - [bem «| 


1 


=k 
a+B+nk+2 (a+ B+nk+2)*(a+B+nk +3) (8.17) 
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Remark 8.6. The conjugate prior property for parameters holds for future observables as well. In 
particular, if g(®|a) is the prior pdf of 6 then the requirement that g(6|&) and g(6|x, a) belong to the 
same family implies that the prior predictive pdf g(y|a) and the posterior predictive pdf g(y|x, ©) will 
also belong to the same family. For example, if X ~ N(®, 1) and 8 ~ N(u, 7), then we know that g(6|x, 
Ul, T) is also normal. Further, the prior predictive density g(y|u, tT) is normal and the posterior predictive 
density g(y|x, U, T) is also normal (see Property 2 of Section 4.3). 


8.3. LAPLACE’S RULE OF SUCCESSION 


Let XxX, X,, os X, be iid Bernoulli random variables with common probability of success 0, 


0 e€ (0, 1). Let s= y x, be the number of successes in n trials so that s ~ Bin(n, 0). Suppose prior 
1 


for 8 is Beta(a, 8). Thus, the posterior distribution of 6 is 


g(8|s, a, B, n)=Beta(at+s, B+n-s). 


Suppose y= (Y1,--»¥m) is the vector of future m Bernoulli independent random variables each 


m 

with probability 8 of success, and let t= y y; be the number of successes in future m independent 
1 

Bernoulli trials. Then, the predictive distribution of t is 


1 
g(t|s) =| f(t | )g(0|s)d® 
0 


alae 


Mm At m-t : : 
“J [7 i (1-6) “g(8|s)d®@, since t ~ Bin(m,6) 


= ial gently _ gyBntm—tl e 
) Boot, Ba) 


m \B(a, +t, By +m—t) | (8.18) 
B(On Bn) 


where a, =Q+s, and 8B, =n+f-s. 
m 
In particular form = 1, t = 1; " Jn and 


BO, +1 By) O _ ats 


cee Cy See eae ee 
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For Bayes-Laplace uniform prior for 9, i.e., 8 ~ U(O, 1), we have a = 1 = 8, then 


s+] 

n+2 

Result 8.2. If we have n independent Bernoulli trials with the probability of success @ and if @ has 
U(0, 1) distribution then, given r successes in n trials, the probability of success in the (n+1)th trial is 
(r+1)/(n+2). This result is known as Laplace's rule of succession. 

This result had stimulated considerable debate about the nature of inductive inference. 

Remark 8.7. The Laplace’s rule of succession occupies a supreme position in probability theory. It is 
often misunderstood and misapplied rule in the theory from the time Laplace first gave it in 1774. It 
is, like Bayes theorem, one of the most important constructive rules besides principle of indifference 
for converting raw information into numerical values of probabilities and provides connections between 
probability and frequency. 

In the Appendix of Bayes “Essay”, Price discusses the famous “problem of the sun rising”. This 
solar problem is, infact, found in various forms in David Hume’s writings in the first half of eighteenth 
century. 

It is foolish to use the rule of succession when n is very small. It is so because if we have no 
prior evidence about the event and we make a small number of observations (that is, we have practically 
no evidence) then we cannot expect to get anything useful out of it. The obtained numerical values 
of the probability will be very unstable. 

According to Jaynes (2003), the rule of succession is the solution to a certain problem of 
inference, defined by the prior probability and the data. The case where the problem can be reasonably 
idealized is one when only two hypotheses exist and there is a reason to believe in a constant “causal 
mechanism” and no other prior information can be assumed. It was denounced by the nineteenth 
century workers (like Boole and Venn) because it was not a solution to their problems. 

Venn (1866) and others produced examples where Laplace’s rule of succession and common sense 
were in conflict, and without making an attempt to understand the reason for it, rejected the rule in 
any and all circumstances. For example: 

A boy is 10 years old today. Casual application of the rule of succession gives the probability 
of his living one more year as 11/12. The boy’s grandfather is 70; according to this rule he has the 
probability 71/72 of living one more year. 

Remark 8.8. If all the first n draws give an outcome from the same subpopulation, the probability that 
the next draw will also give an outcome from the population is (n+1)/(n+2). 

According to Jaynes (2003), some criticize it as being based in favour of the most common sub- 


P(t =1|s) = 


population since rare population will not be detected (Popper, 1983). Jeffreys (1961, § 3.3.3) maintained 
that, in physics at least, this rule quite often led to rejection of the proposed distribution. 

He also noticed that Laplace was quoted out of context, and in order to demonstrate the 
absurdity of the rule of succession, some authors applied it to a case where it did not apply, because 
there was additional information which the rule of succession did not take into account. 


Generalisation of Laplace’s rule of succession 


Suppose there are k different hypotheses H,, H,, ..., H,, and the causal mechanism is constant 
and there is no other prior information. The random experiment is conducted n times and observe 
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k 
hypothesis H, true n, times (i = 1, .., k) so that y? nj =n. The probability that in the next m repetitions 


i=l 
of the experiment, the hypothesis H, will be true exactly m, times (i = 1, 2, ..., k), given the earlier 
outcomes, is 


k 
n,; +m; n+m+k-1 
P(my,mg,....M_ |y,09,-..0K) =| | o Vi _ } (8.19) 


i=l : 
In particular, the probability that H, will be true on the next trial, we shall have m = m, = | and m, = 0, 


for 1#1. Thus, 


n, +1 n+k) n,+1 
P(m, =m|nj,n»5,...,n,) = = : 
(my |nj,n2 k) & wee] iy 


Note that for n = n, = 0 (i.e. no sample information), the probability reduces to 1/k which is the answer 
provided by the principle of indifference. In case, there are only two hypotheses, the probability is 
1/2. A consequence of this is that any conjecture before any verification has probability 1/2 of being 
true. 


8.4 MISCELLANEOUS EXAMPLES 


Example 8.8. (Leonard & Hsu, 1999) Suppose that X,, X,, ., X,, X,,, is a random sample from 
U(O, 8), but X,_, is yet to be observed. If the prior distribution of y = log®@ is N(u, 67) 


(a) find the posterior distribution of y, given x = (xy,...,X,), and show that this distribution 


is truncated normal, 
(b) find the predictive distribution of X44 given x. 


[Hint : ny+(y—-p)*/20° =(y-p+no’)* / 20° +n(u—n’o’ /2) ]. 
Solution. (a) The likelihood function of 0 is. 


£(8|x) =O "Iz<9); where z = max(xy,...,X,). 


We have ¢(y| x) =exp(—ny)I 


(log zSY) * 


Thus, the posterior distribution of ¥ is 
i 2 
g(¥| x) £(y| x)exp] -—z (Y-w) 
20 


1 
207 


ce exp|-n- 1-1 [lpwe 


1 22 
« exp| str (y-W ns ) lee 
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1 2k 2 
=Cc exp - 26? (Y- Y ) Lote] > (8.20) 


-1 
é 
where c=arety7|1-0f BEET ] : y =[—no’. 
fo} 


(b) Setting 6 = e’ in the expression 


4 
f(xp4, 18) =8 Tx, ,1<6] 


gives f(xy4)|Y= € "Tisiog Xn41]° 


ok 
Also [(y>Iog Xn4l) I(y>Iog z) = Fine a where z =max(X,41,Z). 


Thus, the predictive pdf of x,, | is 


1 


2% 1X= [fa lDecrlody 


2 
However y+ a-y¥) = -7 +7 . where 7 = y -—o”. 
20 20 2 


Therefore, 


co 


Cy) 1 ree) 

——— ——— — d 

8(Xn41 |X) = PR TE I ex oa (Y-¥) i 
logz 


{ofr 


a}- (eet a) if 0<Xqy1 <Z 


7 8.21 
oe ere if 2<Xpyp<o on 


2 
where C, = ana?) /2ce- + = . 
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Example 8.9. (Sinha, 1998) Let x =(X(1),X(2),---.X(n)) be the recorded failure times of n components 
of a system and let (y,,, ¥..5 ++» Yqq)) be the failure times of a future sample of m similar components. 
Suppose that the failure-time distribution is exponential with pdf 


Fx|@)=2e*'®: x >0, 0>0, 


then the likelihood function of 0 is 


I S = 
£(8| x) =——exp] —— }; 8>0 and S= Xx. 
[x oF r( 5 ) L i 
Assuming Jeffreys’ non-informative prior distribution for 0, that is, g(8)e< 1/0, the posterior distribution 
of 0, given x, is 


n 


N) 
6|x)= 
aC |x) T(n) gat 

which is Inverted-Gamma(n, S). 
If the future failure-time y is independently distributed as exponential with mean 9, then the 
distribution function of y is 


cs oH n>0, S>0, 6>0, (8.22) 


y 
F|8)=— | et Ody = 1-27 9/8 
0 


Therefore, the pdf of kth order statistic Yay Of the future m independent observations is 


m! 4 Pe 
Pe 19) = ign FI)” FH IE-FEa)™ 
_ m! (1 ery 1 rw (em) 
(k-1)\(m—k)! r) 
m jl yay /0 KI va) 
“1 P Hatt-e je: (8.23) 


Thus, the predictive distribution of y,,, is 
aya) 1)= | fo 101x000 
0 


k-l oe 


s" m i{k-1 1 1 . 
a) co ; ! —rex9] —Zyayim—K+i + +5) ao 


i=0 0 


5 1 y — k-1 T(n +1) 
Tn B(k,m=k+1) & i )(yag(m—k+i+ +8)" 
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since the integrand is a kernel of Inverted-Gamma(n+1, Yqg(m—k+i+ 1)+S). So, 


k-1 
ns" ifk-l ; (n+) 
x)= 1 m-k+i+)+S : 

(Yk) |X) Bm—k+D 2 (-)) A (¥(k)¢ )+S) (8.24) 
Remark 8.9. The predictive density of y,, may be expressed in a standardized form by putting , 
yi 
ouBy 2 Ww so that dy, = Sdw, then 

(n+1) 
k-1)( Yao dY() 
x)d m—k+i+I+1 —, 

BV Ky | Xd (Ky = mem ab ( Hi" (= ( ) z 
and 

y a, (n41) 
w|x)dw = w(m—k+i+1)+1) "dw 
a(w| x) aah 7 if | ( )+1) 


which is a mixture of k-densities. 


Example 8.10. (Sinha, 1998) Suppose X = (X,,X9,...,X,) is a random sample from 


rel =gexo| (“5 00 <UL< x <oo, 0>0 


and suppose tt and 6 are a-priori independent and their joint prior distribution is Jeffreys’ non- 
informative distribution 


g(t, 8) = g(u)g (8) « 1/0. 
Since 


1 ly 
£(4,0| x)= — Sex =) 


i=l 


and g(u,0|x)« 


= &XP| ~ a (Xj-H) |, —o0 <H<Xq) <ee, 850, 


with normalizing constant as 


X01) 


m(x) = ! J 5 nt P| >) (xj-W) |d0dy 


X(1) —n 


=) T(n) : (x;-n) | du 
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(n—I) 


— rin (e ) 
oa ao) 


—n(n-)) gn S+n(xq) -H) 2 
Hence, g(u,8| x) rin) S pel oa r , where S§= L (x; —Xqy)- 
i= 


The marginal posterior distributions of 1 and @ are 


siu| x)= 20D sof 1 on] oo le 
0 


T(n) gt! 0 
_{ n(n—s™ Tn) — n(n-1s™ acieeaee 
T(n) (St+n(xqy—W))" — (S+n(xq) —p)" 


and 


n(n—1) Ss! S+nx() ie 
2(8|x)= exp 
© T(n) gut! 


_ n(n-1) ile = S+nx(1) 
T(n) gntl P n 


1 n-l 
a ep a » 9>0,. 
T(n) gutl n 
The predictive pdf of the kth order statistic of the future m independent observations, y,., may also 
be obtained. Since 


5p | om 
g 
no} 
ON 
leo 
@ | 
[er | 


Z Zz 
1 = Z— 
_ _! use y/O, 4 7 Ht 
Fz) = | f(y|8wdy = Fe ic: dy =1 ex} }} (8.25) 
LL mu 
the pdf of y,,, is 


m! 
(k—1)!1!Gm—k)! 


_ k-1 a _ m—-k 
7 1 — Yk) 1 on Yk) ~H eG Y(k) ~H 
B(k,m—-k+1) ce) ) ) 6 


£(¥ ac) |H.®) = (FV 09) *£ Yay) FV ag) 


270 Bayesian Parametric Inference 


= J y *'f vex] Ps Jom Keien} Operas so Q>0. (8.26) 
eB(km—k+1) & [ i 8 
Hence, 
eae 
eva l=] tvale@gt.8|xdud®, — -<p<m<e, 6>0. 
«a: 0 


where m = min(x,, y Vig) 
Case 1. When x, < y,, then m = x, and we have 


X(1)_ 00 


&(Y¥(K) 1X) = | | F (YK) |H.8)g(H, 8 | x)dude 
co 0) 


— n(n-ys™t  “OeklK-1\(—i S+n(xqy -H) Y(k) —H 
aa ee eer sep» i le exp 7 exp - (m—k+1+i) |dudo 


_ a(n—DS" "PM +) a(k-1 7 7 _ _ _ . (ntl) 
cae Y( Jo | [St+neq 1) + (yo) —Wm—k+1+i) | dp 


—oo 


2 n-1 k+l Xa) 

la —— i | ; _4-(n4l) 
ee a -1 S+ +(m—-k+i+l —-wian+m—k+i+l d 

B(km—k+ 2 i ce [ nx) +(m i+Dyq) -Wa@+m i )| m 

; . —n 

_ (n-1s"! FF (k-1 ni (S+nxqy +¥qq(m-k+i+1-(@+m-k+i+)xq) 

Bik,m—k+) 49 | i prem err,” 

. —n 

_Wa-pst k-l ( pi [S+(m-k +i+D(yay — xy) | - 

Bie (nt+m—-k+i+1" (8.27) 


Case 2. When y, < x,, then m= y,, and we have 


1)’ 


= Yk) °° k+l 
___nn-s™ ait 
aYay 1) B(k,m-k+)P(n) J | d ( : } oo 


. _ ie 
o0| +n(X (1 Jo a * om kets ou 


8 8 
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— n(n—1s" : k-1 oy +1)[S+ (Xp) —B) + (Yee) — LC Ken Ora 
B(k,m—k+)DIP(n) i J 7 n(Xqy) —W) +(¥(K) — W)C i Mm 


—co 


2 n-1 k-1 Y(k) 
n°(n-1)S k-1 F | . — 
~ Ba m—k+) = S+nxqy) + m—k+i+l)-y(m—-k+1+i+n d 
B(k,m—-k-+1) y( i ) [ ay + Yan )- WW )| u 
1 . —n 
oo y — Ny [S+nxqy + yay (m—k+1+i)—yog(m-k+1+i+n) | 
~ B(kym—k+1) = i (m—k+i+n+l]) 
5 ae k-1 
n (n—-1l)S = k-1 : 4 
“Baomokepe* = -1)'\(m-k+i+n41 
cs. n(x - Yay) | L _ |CD'@n-k+itn+)) < 


Example 8.11. (Geisser, 1993) Suppose X=(X;,Xo,....Xy)=(X,X), where 

x =(X,,X,...,X,) represents a sample fully observed from an exponential survival-time density 
f(x|@)=0e%, 0,x>0 

and x® represents sample censored at X__, .... X,, respectively. Hence 


n N 
8|x)«| [ [foi 1© | [] G-Fe: 16) 
i=l 


i=n+1 


N 
x 8" exp 0) "x; , since F(x | 6) = ie. 
i=l 


Further assume that g(0) =B% 9% eB9 /T(a), then 


2(8| x) =Be! 6% 1eFi? /T(cx,), (8.29) 


N 
where 0, =Q@+n and B, =p+)> Xj. 
i=l 
Suppose we are interested in the R future observations of Y a ney that will survive time t. Then 
P(Y>t|@)=e 


and 
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PR=r1x,0)= [f(r] 620/940 


_ fee a aM “Bre” 1, Man a, 
0 


: M "Br" fe — "|< ye9 T (ot), 


e M-r )r (tr+B1) oy -1 Ot sj 
7 a ‘yo | . I PU QT (e140 /T (0), 
J-0 Dd 


< 


< 


M 


a * (-1) (M “urrnpy® 


k 
Example 8.12. Suppose X ~ Multinomial(n,®) where 0< 0; <1, y'6; =1, X= (X,, ....X,_,) and 
i=l 
k 
yx =n. Suppose the prior for ) is Dirichlet (MK) , where = (My, Hk-4) . The posterior 
i=l 
K-l 
distribution of 0, given x, is Dirichlet} u+x, Ux +n-)> x; |. Suppose Y is a future independent 
i=l 


k-1 
random vector having multinomial distribution with parameters m and 0, 0< 0; <1, y'6; <1. The 
i=l 


predictive distribution of Y, given x, is 


aylx)= | f(ylm.826| x08 


k-l k-l n—-)* Xj +h, 1 


D| xX+H, n-) X; tH, 


i=l 
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m kl ke 
y k-1 ; k-l m+n » Xj y YitH-l 
= ~ xithityil] 1_ Yo. iti 
= — | i 1-6 d9 
D) x+u,o-Y” x+y, [A = 
i=l 
k-I 
bs I] T(x, ty,tu) , k-l kel 
y io a) YL Gityitni mtn-)) xi-)) yithe-l 
= ~ i= | {i=l ij = 4 dt 
kel = 1 


D| x+H, n-)" X, tu, JT (x; +y,+H;) |° 


m k-l k-1 k-l k-1 
" I] Pox, t+ yj + Hy) [P L (xj ty, +My) WP m+n-)" xi-) Yi ty 
_\* i=l =! i=l i=l 
k-1 k 
D] x+u, n-)) X; +H, I , (x; + yj + Hy) r}m+n+)° Lj 
i=l i=l 
k-1 


D a yi +U, 


m 
: kel 
* D| x+u, n-)) Xj +My 

i=l 


k-1 
-|o O<8),05,..., 0, <1; yy 0; <1 , 


y,=0,1, 0m; i=1,2,..k-1; Yo yj<m, 


which is known as Dirichlet-Multinomial distribution. 


Remark 8.10. For k = 2, then Dirichlet(®,, 8,) reduces to Beta(0,, 0,) distribution and Multinominal(n,®) 
reduces to Bin(n, 9). Further Dirichlet- Multinomial DiMu(n, 8,, 9,) reduces to Beta-Binomial(n, 0, 9,). 


8.5 PREDICTION FOR EXPONENTIAL FAMILY OF DISTRIBUTIONS 


Result 8.3. Suppose X =(X),X»,...,X,) is a random sample from the one-parameter exponential 


family f(x | 6) =u(x)v(8)exp(co(6)h(x)) having the conjugate family for 6 
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g(0| Tt). 7) = (K(t,,7)) 1v(8)™ exp(co(0)t,), OE O, 
where, 7,7, are such that k(t),T,) = | (v(0))*° exp(ch(8)T, )dO < oo, 
(3) 


Then, the predictive density function for future m independent observations Y =(Yj,..-; Ym) is 


k(t tn+m,%1+)) h(xi)+ > h(y;)) 
s(y|x,%%)=] ] uo i , (8.30) 
Fl k(t, +0,%) +)” h(x) 


i=l 


Proof: Since the posterior distribution of 6, given x , is 


n 
2(8|x)=g) O[t, +n. +) h(x) |, 
i=l 


we have 


aly [Xt .t1) = | f(y] 8)e(8| x)48 


io) 
m n - n m 
=| [Jeop |) Gotmtt+ py nex) | ec@y'o*?™ exp co(6)t, +) h(x) +} h(y;) {dO 
jel i=l rs) i=l jel 
k] (ty +n+m,1 +) hOx)+)° d(y;)) 


i=l j=l 
=[]eo a 
ae k To+n+)) h(x; ) 
i=l 


Example 8.13. Let X = (Xj, X,...,X,) be a random sample from Pois(6), then 


n n 
£(0| x)= I] (x; ij exp -n6+)° x; log@ |, 


i=l i=l 


having the conjugate prior density for 0 
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g(8| T,%1) = (k(t,7))) 0% exp(—7,8) - 


Hence, the predictive density of future observations Y =(Yj,..., Ym) is 


n m 
k To +n+m,1+)° x +) yj 
i=l jal 
n 
k To+n+)) x 
i=l 


a(yIxX.tot)=|[ [Oyo 


Since, 
T T(t, +1 
k(t.) = | 6" exp(—t,0)d0 = a 
fo! > 
therefore, 
uty x,t] 
VT) t+ x; + +1 ((t,+n) i 
‘i 1 Ys y Yj ( ) 
g(y|x, ToT) = []o; y! 
j “ty Xx D) yjtl 


r at) x,+1|¢,+n+m “ 7 
i=l 


Result 8.4. Let a statistic t has a sampling density 
f(t | 6) = exp(A(6) + tB(8) + C(t) , (8.31) 


where 9 possesses prior distribution g. Then 


(i) E(B(8) | t) = Ces g(t)- 2 cay, and (8.32) 
at at 
a2 92 
Gi)  Var(B(®)|t) = logg(t)-_,C®), (8.33) 
at at 
where g(t) denotes the prior predictive density of t. 


Proof : Since g(t) = i f(t|®)g()d®, therefore 
(3) 


0 0 
—l th=—1 f(t | 0)g(0)d0 
gr 98800 = 58] 198 
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1 P 
< {i f(t | ®)2(0)d0 


| f(t|@)g(ade 


ra j C’(t)g(t)+ | B(6)g(8) exp(A (8) + tB(6) + C(t))d8 


=Ca- {| B(6)f(t | 8)2(8)d0 
g(t) . 


= C(t) +E(B(6)|t). 


Hence, E(B(8)|t)= “tog g()—-C()- 


In order to get (ii), consider 


a2 
Szlosatt S15 gop BOFEI DEO CH 


a 1 , 1 , 
=C (t)+ = ae (of B(6)f (t | 8)g(8)d8 +—| B(6)(B(8) + C(t))f (t | 6)g(6)d8. 
g(t) a g(t) 


=C’(t)+ |- e a E(B(6) | t) +C’(tE(B(6) | t) + E(B’ (6) | | 
g(t 


=C"()+| -EBE)|d)? -COEBEO)|)+CWEBO)|)+EB"6)|d | 


=C’(t) + Var(B(8) | t) 
Hence, the result (ii). 


Corollary 1: Suppose the sampling probability mass function of a statistic t is a modified power series 
distribution having pmf 


(u(®)) 
f(t| 0) =a(t 
(t | 6) = a(t) ——— v0) 


and the prior distribution of @ is g(8). Then 


E(log u(6) | t) = “tog 0) 
a(t) 
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and 
Var (log u(6) | t) = ——log =—. 


Proof. The result follows for 
C(t) =loga(t), B(®)=logu(@) and A(@)=- logv(8). 


Example 8.14. Let X =(Xj, X5,...,X,) be a random sample from N(0,1) and let g(8) be an arbitrary 


prior density for unknown mean 8. Show that 
E@|x) =x++ log a(x), 
n 0x 
and 
=x 1, 1 = 
Var(0 | x) = —+———— log g(x). 
nn? dx? 


n 


= Mi — 2 
Proof. f(x|6)= ——(x-0 
00 (x | 8) = eo re ) ) 


2 
= exp oe liga a lige F 
2 2 2 2n 
2 =2 
n@ Se nx 1 n 
Here A(9) =———, B(0)=nO, t=x, C(x)=-——_ +— log —.. 
(9) 5 (9) (x) 55 eae 


Hence, E(n0| x)= S08 g(x) +nx 
x 


2 
and Var(n@| x)= Slog g(x)+n 
XK 


Example 8.15. Let X follow generalised negative binomial distribution with pmf 


nI (n+ Bx) 
x!T(n+ Bx —x+]) 


f(x |®) = a= (8.34) 


= exp(x log(O0(1— oP) 4 nlogd—8)+C(x)) . 
forx=0,1,..; 0<O<1; |@B]<1; B=O or B21; n>0. 
Here A(0)=nlog(1—6),t =x, B(6)=l0g(0(1-8)*"!) , 


and C(x)=logn+logI'(n+fx)-—logI(x +1) -logl’(n+Bx—-x +1). 


= ) , 
Edog 0(1— 9)P : |x)= 5, BBE (x) 
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= © toge(x)-W(n+Bx) + W(x +) + ¥(n+Bx-x 4D. 
x 
where ‘Y(-) is a digamma function. 
Also, 
a2 
Var (log 6(1— ge! |x)= ae g(x) -P’(n+Bx)+ P(x +)D4+ (n+ Bx —-x +), 
Xx 


where P’(-) is a trigamma function. 


In particular, for B = 1, f(x | 8) is negative binomial distribution and we have 


) 
E(log ®| x) = 5, 08 b()-YO+s)-Yes+h, (8.35) 
a2 
and Var(log 6| x) = ee ee Ee ee (8.36) 
x 


If g(0) is a Jeffreys prior, ie., (0) «< 6 /7(1-6)"!, then 
! 1 x-1/2 n-l 
a(x) =| eFC | )d0 =e 8 a—6)""d0 , 


nI'(n +x) a 


where ©" Sir@+t | x 
n+x-l 1 
Hence g(x) = . B ie . 
) 1 1 1 1 
E(log@ |)—< logl is —logT Pes =W ar a ean ; (8.37) 
and 
Var(lo a)xy= 2 logI es —logI idee = es oe ee 
g 5x2 g 5 g 5) 3 5 | (8.38) 


Remark 8.11. It is interesting to observe that the function B(®), appearing in the exponential family 
of distributions (8.31), is the canonical link in the theory of generalised linear models. In the case of 
binomial distribution, the canonical link function is logit link function (See Gill (2002), page 40). 


8.6 PREDICTIVE DISTRIBUTION AND RELIABILITY ESTIMATION 


Let the random variable X represent the life of an item or a component and let f(x|®) be its pdf. 
The reliability function at any time t is defined by 


R,@)=PIX>t|6]=[ f(x|O)dx. 
t 
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Under SELF, the Bayes estimate of R, is given by 


R,@)= | R,@g@| x10 


=|] [ te 1®ay |e] ae 
© \t 


On changing the order of integration 


co 


=| J #19)2(0| oaoay 
to 


R,@=| f(y | x)dy =P(Y 2 t|x) (8.39) 
t 


Example 8.16. Suppose X=(Xj,X»,...,.X,) is a random sample from Rayleigh pdf 


2 
f(x |0)= son x F =} x,0>0. Assume that the prior for 0 is Hartigan’s ALI prior g(0) « 1/@°. 
6 


The posterior distribution of 8, given x, is 


1 9 
g(O|x)« err) 0-525 } ee 
j= 


and the predictive distribution of future independent observation Y is 


co 


sylyx | ae xo{— 20" +5) fo 


0 


2y(n+1)(s)"*! 
ey|x)= oo 
*+$) 


(y +S 


and the Bayes estimate of the reliability function is 


(n+) 
Rit)= { etvlody= Gea t>0. 


t 
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8.7 PREDICTIVE INTERVAL 


We can extend the idea of constructing shortest credible interval for the unknown parameter to 
the case of constructing “highest predictive density intervals” for the future observation. The highest 
predictive pdf interval is, therefore, an interval with the given probability content such that predictive 
pdf’s values over the region are not less than those relating to any other interval with the same 
probability content. 

Example 8.17. Suppose X,, X,, ....X, is a random sample from N(0, 1) distribution. Let us assume that 
the prior distribution of 8 is g(8) « 1. The predictive pdf for a future observation X__,, independent 


n+l? 


—_ n+l 
of the observed sample, and having N(@, 1) distribution is known to be N(= = } Since the 


predictive density function is unimodal as well as symmetric about its mean, the 95% highest predictive 


- n+l _ n+l 
density interval for X,_,, is [F196 {—, x+1.96,/—— } 
n n 


Example 8.18. Suppose X has a Bin(6, 9) distribution and the prior for 8 is U(0, 1). If X = 3 is observed, 
then the predictive distribution of future independent observations Y having conditional distribution 
Bin(6, 9) is beta-binomial 


6 \B(y +4,10-y) 
X=3)= , y=01,...,6. 
g(y| ) | B44) y 


(n+1)x 
n 


Since it has a unique mode at |- [3.5] =3, the shortest predictive interval will be around 3. 


In order to work out the shortest 90% predictive interval, we have to find out the values of a and b 
such that 


Ply (a,b)| X =3]20.9. 
The following table gives P(Y=k|X=3), k= 0. 1:...3,-6. 


Ea I a ee ee ee ee 


P(Y=k|X=3) 0.0490 0.1305 0.2330 0.0490 0.1305 0.2040 


The shortest predictive interval is [1, 5], since 
P(y e€ [1,5]| X =3) =0.91. 


8.8 DECISION THEORETIC APPROACH TO PREDICTION 
The decision theoretic approach to prediction is similar to the one concerning parameters of the 


model. Here our interest is in predicting values under a given loss or utility function of future 
observations or their functions. 


Let X =(X,,X>,...,X,) be a random sample from a population having pdf (or pmf) f(x|@) and 


suppose the future experiment produces an observation y where ye Y , Y being the space of future 
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experiment. In this set up, Y becomes the action space. Our decisions concerning y will be based on 


g(y|x). 

For a point prediction problem, like in parameter estimation problem, the action set is Y, the space 
of future observation. The loss function associated with each prediction (or action) a € cW%(= Y) and 
for each realizable outcome y, we have a loss function L(y, a). The Bayesian approach to statistical 
decision theory will direct us to choose an optimum action a° which minimizes the expected predictive 
loss 


La)=| Loy.agty|xdy, 
Y 
The Bayes prediction a° is such that 


L(a’) = inf L(a) 
acA , (8.40) 
The Bayes prediction depends on the observed data x. 
Example 8.19. (Example 8.1 continued) Under the linex loss function 


Lty, 9) =e#9-9) —a(g—y)-1, (8.41) 


the Bayes point predictor is 


22 
j=- diee E(e *Y |x) =- Fig oo|-x + a 
a a 2n 


ao’ (n +1) 
2n , 


=x- 


Remark 8.12. If o? is unknown, Zellner (1986) suggests to use x (xX; x /m—1) as an estimate 
i=l 

of 0? to obtain an approximate point predictor. 

Example 8.20. (Zellner, 1988) Suppose Z = logY has a predictive distribution N(m, v). If the loss 

function is the relative squared error loss function for point prediction y, Ly,§)=(1-S/y)" , the Bayes 

point predictor is 

E(I/Y) 

E(I/Y2) | 


where the expectations are taken with respect to the predictive distribution g(y|x). Since 


y= 


E(Y~!') =E(e-“) =exp(-m+v/2), 


and E(Y~?)=E(e?~) =exp(—2m+2v), 
we get 


y =exp(m—3v/2). 
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Note that the point predictor y, under relative SELF, differs from the mean of the lognormal 


predictive pdf for y which is E(Y) = E(e’) = exp(m + v/2) and it is optimal for SELF. However, predictive 
median of Y, which is e™, is optimal for an absolute error loss function. 

Example 8.21. (Bolfarine, 1987) Let X,, X,, ..., X, be independent and identically Bin(1, 8) distributed 
random variables, so that 


P(X, = 1) =8 
P(X, =0)=1-0;i=1,..,N; 0 [0, 1]. 
Let us denote the observed sample of n (< N) units by x= (x,, .., x,) and the unobserved part of the 


population having (N — n) units by x= (x,,,, .... X,). We are interested in estimating the population total 


n+l? 


N n N 
eee | = 1 
T =), x;. Since T=nx, +(N—n)x,, where X, = -)" x; and x, = ae y x; , the problem of 
n —n 


i=l i=l i=n+l 
predicting T is reduced to predicting X,, given X,. 
Under squared error loss function, the estimate of x,, given X,, is the predictive mean E(x,|x,). 
Thus the Bayes predictor of T is 
T =nx+(N—n)E(,|x,)- (8.42) 
Let us assume that the prior distribution of 8 is Beta(a, b). Then 


EX, |X.) = E(E(x, |x,,0)|x,) =E(®|x,) 


at+nx, 


atb+n- 
So, 


a +nx 
feos, toN—n( SO a } 
a+b+n 


The predictive risk of T is 


= 2 
RT, T) = E(T-T)° = elon +(N—-n)x, -{o Ren) peceey — ) 


a+b+n 


a) 
Nay (8, | 
at+b+n 


Since E(x?) =Var(x,)+E*(X,) = — Q°, 


E(a+nx,)” = Var(nx,)+(a+nE(x,))” =n@(1—6)+(a +n@)°, 


and x, and x, are independent, we have 


E((a + nx, )x,) = E(ax, +nx,x,) =a0+ nd. 
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Note that the expectations are taken with respect to f(x |). On simplifying, we get 


2 
RET) =(N | nOl-@) | 1-8) | (a-@(a+b)) | 


(atbt+n)> N-n  (a+b+n)? 


The Bayes prediction risk of T is 
r(T) = E(R(T,T) | X,) 


py op | Someone nt | zernery) 


(at+b+n)(atb+n+l) |(a+b+n)* N-n (atb+n)4 
since 
= = \b+n—nx 
E(0|x,) = atMXs and E(6(1—8)|x,) = (a+nx, )(b+n—nx,) 
a+b+n (a+b+n)(a+b+n+4l1) 


Example 8.22. (Zacks, 1981) A certain commodity is stocked at the beginning of each day according 
to policy determined by the following considerations: 


The daily demand (in number of units) is a random variable X having pdf f(x|6). Let 


X,, X,, ..., X, denote a sequence of iid random variables having pdf f(x|®) which represents the 


observed demand on consecutive days. The stock level at the beginning of each day, S,, n = 1, 2... 
can be adjusted by increasing or decreasing the available stock at the end of the previous day. 
If C(S, X) represents the cost of a wrong decision about adding or not adding to the available 


stock at the end of the previous day and g(@) is the prior distribution of © then the prior expected 


daily cost is 


r(S,g) = | R(S,@)g()d0 , 


(c) 


where R(S,8) = E(C(S, X)) = y f(x | ®)C(S, x). 


x=0 


Interchanging the order of integration and summation in r(S, g), we have 


S.2)=)° C(S,x)[ f(x | @)g(0)d8. 
x=0 0} 


= y C(S, x)m(x) 


where m(x) is the marginal (unconditional or prior predictive) density function of X. 
After observing the value x, of X,, we may take posterior pdf g,(®|x,) in place of g(®) and 


determine the predictive density 


a(x, |x.) = [fC |g, (8| x,)d8. 
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The expected cost for the second day is 


r(S,.2)= ) CS,,x,)m(x,) 


x, =0 


y CS,.x.) g(x, | x,)m(x,) 


x,=0 


y m(x,) y C(S,, X, g(x, | X,). 


x,=0 x, =0 


Note that the term y C(S,,x,)g(x,|x,) is the posterior expected cost, given x = X 
x2=0 
Similarly, given the demands x = (x,,...,x,) on n days, the optimal stock level for the beginning 


of the (n+1)st day can be obtained by minimising the expected predictive loss. In particular, if our loss 
function C(S, X) is a(s —x)’, a > 0, the optimal demand will be the mean of the predictive distribution 


g(x,,, |X). If our inventory cost function is bilinear 


k(s-x) ifs>x 
k,(x—-s) if x>s, 


C(S, X) = (8.43) 


where k, (>0) is the daily cost of holding a unit in the stock and k, (>0) is the penalty of shortage of 


a unit, then the optimal demand on the (n+1)st day will be Ke ia fractile of g(x,,,|x). 


k, +k 


1 2 


In particular, if the demand on a particular day follows Poisson distribution with unknown mean 
8 and the prior distribution of 8 is g(®) « 1/0, then the predictive pmf of the future demand after n 


i=l i=l 


days is Nsin{ salt and the Bayes estimate of the demand, under SELF, is y x,/n. 


-a 


However, under linex loss, the Bayes estimate becomes = x, bal l-e } where a(#0) is the 
a n 


i=l 


shape parameter of the loss function. 
8.9 BAYES PREDICTION WITH INDUCED LOSS 


The posterior predictive expected loss is 


L(a) = E(L@, y) | x) 


- | La »| | yl elf, 
Y ° 
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On interchanging the order of integration, we have 


=| | Laity oxy fu] xa 
1c) Y 

= | M@,6)g(6| x)d0_ (8.44) 
12) 


where M(a,@) =| L(a, y)f(y | ®)dy . (8.45) 


Y 


The function M(a,®) is the expected loss with respect to the density f(y|6) of the future 


observation Y. Aitchison and Dunsmore (1975) called it “Induced Loss”. Thus L(a) can be considered 
as posterior induced expected loss. The point predictor ‘a’ can be obtained by minimising 


E(M(a,6)|x) with respect to ‘a’. 
Example 8.23. Suppose XK =(X,,X,,...,X,) be a random sample from Pois(®) and X,, is an 


1 


independent observation from the same distribution. If L(y, a) = (y — a)’ and prior distribution of 0 is 
g(9) « 1/0. The induced loss function 


e 6" 
M(a,®)=)" (y-a)’ oo 0+(8-a)”_ 
and the ae ee loss 
L(a) = | (0 +(0—a)*)g(@| x)d6 = E(@+(8-a)")| x) 
Differentiating an respect to a, 


#965 = 2E((8—a)|x) =0 
da 


a=E(6|x)=)) x,/n=x, 


i=l 
Hence, the point predictor of y is x. 
Example 8.24. (Williford and Bingham, 1979) Suppose X =(X,,X,,...,X,,) is a random sample from 
zero-inflated Poisson distribution having pmf 
P(X =0)=@+(1-@)e® 


-0i 8.46 
7 G2) on 


P(X = j) =(1-@) 


jt’ 
where 9>0 and 0<@<1. Assume that 6 and @ are a-priori independent with joint prior distribution 


g(9, @) « 1/0. 
The likelihood function of (8, @) is given by 


(0,@| x) =(@+(1-@)e°)" -@) Ne 2X ™ o/TT x,! 
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2, n “ 7 “ N-n, 
=6' . le (1—@) N74 ON? Il x,! 
J i=l , 


j=0 


N-n, 
where x =(X,,...,X,), t=). x, , and n, is the number of observations in the ith class. 


i=l 
The joint posterior distribution of 8 and @ is 


No 


y Joo- c)19t1e-8N-D 
j 


i=0 


y ". ov.x-ieoron—p" (8.47) 


j=0 


Suppose Y is a future independent observation from zero-inflated Poisson distribution (8.46). If 
L(a, y) = (y —a) is the loss function then the induced loss function is 


M(a,0,0)=)" (y-a)’f(y|®,@) 
=a’(@+ (ae) +)" (y—a) (l-@)e °6/y! 


=a°o+)" (y—a)*(1-a)e “0” /y! 


y=0 


=a’o+(1-@)(0+ (8-a)’) (8.48) 


and the posterior predictive expected loss is 


L(a) = {j M(a, 0, 0)g(0, | x)d0do 


=E{(6+(@-a)’)d-o)+a°o}, (8.49) 


where the expectation is taken with respect to the posterior distribution g(@,@| x) . Differentiating with 


respect to a and equating it to zero, we get 
) 
5, L) = E[2aw—2(1—-@)(0—a)| =0, 
a 


Thus, the point predictor of the future observation y, under squared error loss function, is 


a = E(6(1—@)| x) 


ty’ e parix-iener—oe 
j J 


y [ates ines" 


Chapter 9 


Bayesian Inference for the Linear Model 


The linear model may be considered as an equation that involves random variables, mathematical 
variables, and parameters that is linear in the parameters. It may be expressed as 


y=XB+u, 
where X is nxk model matrix of observed data, B is kx1 column vector of unknown coefficients 


to be estimated, Xf is called the linear structure vector, and u is the column vector of independent 


error terms with zero mean. The error term may represent specification error and/or measurement error. 
In order to perform Bayesian analyses of a linear model we shall further assume that the error vector 


u is MVN (0,0°V) , where V is a positive definite symmetric matrix. If V = I, we call it a homoscedastic 


model, otherwise heteroscedastic. 

The linear model includes sequence of random variables, simple, and multiple regression models 
besides models for designed experiments. 

In this chapter, a detailed Bayesian analysis of the linear regression model is provided. In 
particular, we shall illustrate results for simple univariate linear regression model when the disturbance 
term is either homoscedastic or heteroscedastic. Section 9.3 deals with Bayesian predictive analysis of 
the simple linear regression models discussed in earlier sections. Sections 9.4 and 9.5 present some 
interesting examples in Bayes estimation and prediction in a regression model. Section 9.6 gives 
procedures for testing and comparing hypotheses concerning the regression coefficient. Simple control 
problem is discussed in Section 9.7 and results for the general linear model are presented in the last 
section of the chapter. 

The reader may consult Raiffa and Schlaifer (1961), Box and Tiao (1973), Zellner (1971), Leamer 
(1978) and Broemeling (1985) for more details and other linear models. 


9.1 HOMOSCEDASTIC DISTURBANCES 


Let us consider a simple univariate normal linear regression model 
y,=Bx;+u,, i=1,2,..,n, O.1) 


without the intercept term, where Y,> Xs and u, are the ith observation on the dependent variable, 
independent variable, and the unobserved value of the random disturbance, respectively, and B is the 
unknown regression parameter. We shall adopt assumptions of normality, independence, linearity, and 
absence of measurement errors in the following discussion. 
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The relationship in equation (9.1) is linear in B and u, and, therefore, it is a linear regression model. 
Let us assume that each u, is N(O, 07) distributed and errors u,, U,, -.U, are independent of each other. 
Let us further assume that Xj, X55 vy X, are fixed non-stochastic variables. 


The likelihood function based on n observations z= {(x,,y,):i=1,2,....n} is 


«B.olz)=T] f(y; |B.) 


co exp| (y, —Bx,)° jps'] 


i=l 


«ot ox —1)6° +6-6); x; | fae : (9.2) 


i=l 


where p=) XY; y x. is the ordinary least squares estimate (OLSE) of B and 
i=l i=l 


n 


(n-)6? =)" (y, -Bx,)’, 


i=l 
Case 1: o known 


(a) Non-informative prior for B 
Let us take Jeffreys’ non-informative prior for B 
g(B) < 1, Be (2, 9). 


The posterior distribution of B works out to be 


1 ‘ 27R _Ay2 
eld =ex|- ae x x; (B—B) |} (9.3) 


i=l 


Since the first term in the exponential part of equation (9.2) is only a function of z and, therefore, 


absorbed in constant of proportionality. It is easy to recognize that the posterior distribution of 6 is 


=I 
i=l 
We may note that the posterior mean, as well as, median and mode, is equal to the OLSE, B. 


—1 
é ‘6 3 2 a ‘ = ‘6 a 
and the posterior variance is © [x Xj which is also variance of B . 


(b) Conjugate prior for B 
Let us write the likelihood function (9.2) as 


&(B,6 | z) exp|- = y 0-6" 


i=l 
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Therefore, an appropriate conjugate prior for B is N (B,,0°/m) and written as 


g(B) « exp| = + (B-B,) a m > 0, By € (2,99) | 9A) 


with hyperparameters m and f.. The posterior density of B is 


{6 By? y xi+mb-8,"}] (95) 


i=l 


g(Blz)« ox 3 


since (B- B) y x, +(B- B,) mp X; ‘+m By + term independent of B, where 


i=1 


i=l 


n n n —1 
= D3 x; +m, VG x; } Hence the posterior density of B is Np o » ~ +m | } 
i=l i=l 


Remark 9.1. If we assume that the precision of the prior distribution tends to zero, that is, m0, the 
prior distribution tends to the non-informative prior g(f)e< constant, and the corresponding posterior 


n 
mean f* tends to 6 and posterior precision tends to y x; /O” as in Case I(a). 
i=l 


Remark 9.2. Let us write B =—= B+ ji B,. which is a weighted average of 


B and B,, so that B° is a compromise between the prior mean 8, and the least squares estimate B of B. 


The weights are directly proportional to their respective precisions. 
Case 2: B and o both unknown 


(a) Non-informative prior for B and o 
Let us assume that B and o are a-priori independent. The Jeffreys’ non-informative prior for o is 


1 
O)x —, o€ (0, ©), 
a(0) « — E (0, ») 


and that for B is 
g(B) « 1, 


so that the joint non-informative prior distribution of (8, 6) is 


1 
g(B,o) < e- Pe (-<°, 09), o>0, (9.6) 
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Thus, the posterior distribution of (B, 6) is 
2(B,5|z) = 4B, |z)g (B, 6) 


eal leat 


i=l 


1 tinge eal | - 
-| dovo{- x; (B—B) | 2 exp| = (n—1)6 }| (97) 


-1 
The first term on the right hand side is a kernel of N [3 °( x, | and the second term is that 


i=l 


of Ive Gann 7 ! ; ox a 


Remark 9.3. We may express the posterior distribution g(B,o|z) as a product of a conditional 


posterior distribution of B, given o, and the marginal posterior distribution of o. It is interesting to 
observe that the posterior distribution of (B, 6) loses the assumed prior independence of B and o. 


Remark 9.4. The posterior distribution of (8, 6) is known as normal-inverted gamma distribution with 


= 
parameters B, o » x , (n—1)/2, and (n—1)6* /2 . Let us recall that a similar result was obtained 


i=l 
when we derived the posterior distribution of (8, o*) of the normal distribution in Chapter 4, Remark 
4.22. 


(b) Conjugate prior for B and o 
Remark 9.3 suggests that the conjugate prior for (6, 6) should be chosen such that it is a product of 
conditional prior of 8, given 6, as a normal distribution and the marginal prior for o as a inverted- 
gamma. 

Sometimes it is mathematically convenient and instructive to work with precision instead of 


variance. Let us assume that the sample z of size n is drawn from the simple linear regression model 


(9.1) where both the regression coefficient B and precision r(= 1/ 3°) are unknown. The likelihood 
function of B and r given in (9.2) modifies to 
n/2 r m2 ayy 22 
£(B,r|z) <r ox -El(n-)9 +(B-B) 2 x I]. (9.8) 


The natural conjugate prior for (B, r) is such that 


2(B, r) = g(Blr) g(r), (9.9) 
where g(B|r) is N(B,, mr) and g(r) is Gamma(u, v) so that 
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e(Pa)er”* ex] -rfve26-)'}] 


The posterior distribution is 


Pia (a-)6?  G-BY  o, Me ay 
2(B,o | z) r as] , + >» xX; tv+ 3 (B B,) | 


= [ exp (3% x +m Je -p y i Ce exp(-rv, ) (9.10) 


where 


Thus, the joint posterior distribution of B and r is a product of conditional posterior distribution of B, 


given fr, as Gz x x + “| and marginal posterior distribution of r as Gamma(n/2 + u, v,).The 
i=l 


marginal posterior distribution of B may now be obtained by integrating out r from g(B,r|z), so that 


we have 
e(B|z)=| g(B.r|z)dr 


7 _({n+2utl 


-[»+6-09(f cen lf] -[mo-or( cam fo] Pan 


which is a kernel of a 3-parameter t-density with (n+2u) df, location parameter 8”, and scale parameter 


oe [mek a 


Vi i=l 


Remark 9.5. If we let u—-1/2, v >0 andm-— 0 in the prior density of (B, r), we have a non- 
informative prior g(, r) < I/r. 


In order to obtain the limiting posterior distribution such that g(BJr) is N{B ry, < ) and the 
i=l 


-1 (16? 


marginal distribution of r is Gana 2* } we must violate the condition u > 0 and u must 
2 2 


now approach a negative number (—1/2). (See DeGroot (1970), page 195). 
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Remark 9.6. The posterior variance of B is finite, only when n > 3. 
9.2. HETEROSCEDATIC DISTURBANCES 


The problem of heteroscedastic disturbances is encountered quite often in cross-sectional data. 
For example, when we consider family budgets, the residuals from the regression are found to have 
variance increasing with household income. In the simple linear regression model with no constant term, 


we assume Var(u,) = O° ,1= 1, 2, ...,.n to represent heteroscedasticity in the model. Suppose we write 


2 


0. =0°A: where i,’s are known, then equation (9.1) may be reduced to 


i 


Spot, (9.12) 


i i, 
where v, = u, / A, having a constant variance 6. The weighted least squares estimator of B from the 
equation (9.12) is 


2 2 
A n Xx. n Xx. . Z n ; 
B= ¥ 1 ye ta with Var(B) = 07 py ] ‘ 


i=l 


(9.13) 


If the variances are known upto a multplicative constant, there is no difficulty in analysing it. 
However, often Var (u,) is proportional to a function of the independent variable x,. In particular, if 


Var(u,) = 6°x,, the weighted least squares estimator (9.13) of B reduces to y/x. 
Let us study the posterior distribution of B when heteroscedasticity is present in the model. For 


mathematical convenience, consider r, = 1/0? =rw,i=1,2,..mand r= 1/o” . The likelihood function 
of (B, r) is 


C(B,r |Z) x y exp| £9 wi (y,- px 


i=l 


2 


= Y wx? ; (9.14) 


where, 
B,, == is the weighted least squares estimate of B. 
Ww n 


Consider the normal-gamma prior for (8, r) such that g(B|r) is N(B,, mr) and g(r) is Gamma(u,v), 
so that 
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a(B,r)ecr 2 oxp|-r| 28- B,)°+ } . 


Since 


y wa-Buy +mB-B.) =[E walem lp eye (By -B,)° 


we have 


eel2)=|#exp|—£{ ms wal \p-6y |] a oxen) 6.15) 


where 


On integrating out r from the posterior density of (8, r), we have 


n+2u+l 
x 


[my WX; Joan : 
=| aa » Be oe), (9.16) 


which is a kernel of a 3-parameter t-density with (n+2u) df, location parameter B’, and the scale 
parameter [msds WX; Joram x 
i=l 


Remark 9.7. For Ww, = 1 for all i, the model reduces to the homoscedastic model and we get the results 
of Case 2(b). 


Remark 9.8. Let w, =1/x,, that is, 67 = 0°x,, we have 


A 


B, =9/%, BY = (ny + mB, ) Am + nx). 


In particular, for a large sample size n or small prior precision (m — 0), B 


Il 
<I! 
~~ 
| 
Il 
go) 
= 
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As before, if we let m > 0, u > —1/2 and v > 0, the posterior distribution of 8 is 3-parameter 


t-distribution with (n—-1) df, location parameter B =y/x=Bf, . and scale parameter 


mi |<! 


Lh (y?) (7) 
(n Hf = y e (2) | Hence, the 100(1— «)% HPD credible interval for B will be 
i=l i 

= 1 Ka j aA 1/2, _ 1 1a 5 2 1/2, 

Ty, -\ Yi af +t, -) Yi 

xX 35et{n-3]nx 4A | x; x X Zt} n—-3]nx 4 | x, 
y 
Xj 


B = : Q 
Remark 9.9. If w, =1/x? then B, =— [= and B = (nf, +m, )[(n+m) . In particular, for 
N j=1 ; 


X. 


i 


2 2 2 
_l\y Yi Ifo y; _ly Yi _# 
ald EE ae) 


Hence, the 100(1-«)% HPD credible interval for B is 
1 1 1 r 
= Yi % Yi : Yi 
t > 
=o [2 nl aoa [2 = [2 ) 


2 
L Yi lg, 1 y yi _1 yi 
nia |X; zt) nn —-3) 4 | x, nim | Xx, 


9.3. PREDICTIVE DISTRIBUTION 


xe oR 1<¢ : ? 
n>o,orm—>0, Bp =B, =— [2] Further if u > —1/2 and v > 0, we have 
n 


Homoscedastic Model 


Case 1: Known precision 
Under the assumptions of Case 1(b) of Section 9.1, the predictive distribution of future 


independent observation y_,, given x. and z, is 
n+ ~ 


n+l? 1 


co 


(Yui [XZ =f Fu |BgBl z4B 


co 


n+l 


- | exp spe] {E <iem}p-y} dp 
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which is N| B’x,,,, 0 = 


Hence, the 95% highest predictive credible interval of y_ 


a 1/2 
2 2 
cut) eee 
i=l 


y x +m 


i=l 


Bx, ,, 1.966 


when ©? is assumed to be known. 


, Bx,,, +1.960 


295 


BXon) : (9.17) 


is 


ni 1/2 
2; 2 
ye ee an 
i=l 


n c) 


yo x? +m 


i=l 


Remark 9.10. We may use the method of iterative expectations to find the mean and variance of the 
predictive distribution. The predictive mean of future independent observation y__, is 


E(y 412 = E(E(y, 4, |B.z)|z) 


- x, E(B | Z) = ea ? 


The predictive variance of y,,, is given by 


n+l 


Var(y,4 |Z) = Var(E(y,4, |B,z)|z) + E(Var(y,,, |B,z) |Z) 


=x, Var(B | z)+07 


Case 2: Both regression parameter and precision unknown 
Under the assumptions of Case 2(b) of Secton 9.1, the predictive density of future independent 


observation Yup given x, and z, is 


CCA owes) | 


—co 


J Br] DEO .u [Xu-BndBar 
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cy n+2u41 | 1 m+ Xx 2 
x]r *  exp|—ryv,+ = Bx dr 
p 1 n Yai n+l ? 
0 m+). x7 +x, 


2v n+2u 4 (9.18) 


m+)) x ea (Yon -B'X yu) 
1 


n 
2 2 
me ye Pky 
i=] 


since (8-8) y x +(B-B, y m+(Bx,,, = Vat y 


i=l 


m+)" x? m)) x 


-[m xx 0 8; +| ——=— |jy,..-B'x,..) +—=_(6-B,) 


n 
x 2 
m+)) xX; +Xi,, m+)) x; 
i=l 


where 


and 


etsy A EY G5) 
m+)" 5 
i=l 


Thus, the predictive density of y_,, is a 3-parameter t-density with (n+2u) df, location parameter 


n+1 


n 
: ae 2 - n+2u 
Bx,,,, and scale parameter = : 
n 
2 2 
m+) x; +x 
i=l 


n+l 
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In particular, if the prior density of $B and r is non-informative, that is, 
g(B, r) x I/r, we obtain the predictive density, by letting u > —1/2, v0 and m-0, as a 3-parameter 


t-density with (n—1) df, location parameter Bx 


2 
yx 
ae and scale parameter i 


y x aD as 
Heteroscedastic Model 


Consider simple linear heteroscedastic model defined in Section 9.2. Suppose y 
unobserved value. Let us consider the prior distribution for (8, r) as 
2(B,r) x 1/r; Be (-ce, 09), r > 0. 


is a future 
n+l 
The posterior distribution of (8, r), given z, is 


g(B,r|z)« [vew[ ff w,X; (B-B" y i |e exp(-rv, | 
where 


2 
Y, “ale ; 2 WiXiyi 
and v, = y Wy; - : 
y: w.x2 i=l 


n 
W;X; 
i=l 
Thus, the predictive distribution of y_,, given z and x 


n+1? 18 


Yn XarZ=] f gBr] Zeon [Bot Xnu)dBdr 
0 -co 


* 2 
n (Yuu B Xoy ) dBdr 
2 2 
Wakay nz >, WX; 


2 
1 y ii 
v,+ 


Pe 2 
2 n (Yau B Re) dr 
2 2 
oD o WX; 


i=l 
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—n/2 
2. 
n+l i“ 


x} 14(n—1) i=l 1 {ee -B Xn) Vout € (—c0, 00), 


n —] 
owas wa | : ™ ; 


i=l 


(9.19) 


As before, the predictive density of y,,,, under heteroscedastic model, is a 3-parameter t-density 


n+1? 


with (n—1) df, location parameter ae ae = Box , and scale parameter 


n 
2 
he wa | 
i=l 
a . 
2. 2. 
2 Wot) W; Xj Vv 


i=l 


In particular, if r= Wx, so that Ww, = I/x?, i= 1,..., n+l, the scale parameter reduces to 
2 -1 
n(n-l)| , 2 y, Ig y,; : .» 1d? y, 
xX. A a and location parameter B° = — aie 
n+l fd 2 aa X, B =e X; 


However, if w, = 1/x,, the marginal posterior distribution of B is a 3-parameter t-density with 


-1 
. 1 fy?) (xy 
(n-1) df, B’ =y/x location parameter and scale parameter (n | —)? 2 [=] . The 
nx f= | X, x 


100(1-a)% highest predictive credible interval for y__, can be easily constructed by using the predictive 


n+1 


given X_,Z. 


n+1? 


density function of y 


n+l? 


9.4 ESTIMATION 


In a simple normal linear regresson model, we have seen that the marginal posterior density of 
the regression coefficient is either a normal or a 3-parameter t-distribution according as the error 
variance (or precision) is known or unknown. Both the normal and Student’s t-distributions are 
symmetric about the location parameter and are also unimodal. The posterior mean, median and mode 
will be same. Thus, the posterior mean of the marginal posterior distribution of B will be the Bayes 
estimate of B under squared error loss aboslute error loss and also zero-one loss functions. 


Homoscedastic disturbances 


Under the assumptions of Case 1(b) of Section 9.1, the Bayes estimate B. and B under the linex 


loss function 


L(B.8, )= eh -a(B,-B)- a #0, (9.20) 
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is given by 
B, =-—logE(e™ |) 
a 
a n 2 
i+ 
4 By x; +mB, a? co 


= [ms y x » - +mB, - 50 } (9.21) 


i=l 


In particular, if the prior distribution of B is non-informative, we get B, by making m 0, that is, 


B, -(B 5°] x! (9.22) 


In case 6? is also unknown, we cannot obtain Bayes estimate under linex loss function if we use 
the prior distribution for (8, 6) as either non-informative prior g(B, 6) « 1/o or the normal-gamma prior 
because the marginal posterior density of B is a t-density for which moment generating function does 


not exist. The Bayes estimate B, cannot be obtained unless we substitute some estimate of 0” in place 


n 
of 0? (see Zellner, 1986). One possible estimate of 0? is 6? = y (y; -Bx;)” fo- , where § may be 
i=l 
used as the least squares estimate of B. 
The Bayes estimate of the future observation under squared error loss function, absolute error 
loss function and zero-one loss function will be the predictive mean, median or mode, respectively, since 
the predictive distribution of the future obsevation y,,, is normal, the Bayes estimate of y,,, will be 


B’x,,,. If we use linex loss function 


Eye Vout ) = a 7a (You ~ Via )-1, at 0, 
the linex estimate of y,, will be 
; 


7 aan +y x +m 
= Px, hl i=l (9.23) 


x 1 - 
Fou = lost (e af 
a 


n 


y x +m 


i=l 


Example 9.1. (Zellner’s MELO Estimate, 1978) Suppose we wish to estimate 
6 = 1/8. If we consider the squared error loss function, the posterior mean of 8 as an estimate may 
not exist. Let us consider, following Zellner (1978), the relative squared error loss function 
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A 2. 
A 0-0 a2 
L(6,6) = eS) =(1-B6) | (9.24) 
The posterior expected loss is minimised for 


a Var(B|z) ) 
0= 1+ 5 
E(B|z| (E(B| 2) 


Under the assumptions of Case 1(b) of Section 9.1, the posterior distribution of B is 
x = 

rolk stm) } then 
i=l 


a | oO 
8 = —| 1+———_~ (9.26) 


B B™ [x stm) . 


and if the prior distribution of B is non-informative then 


(9.25) 


-1 


Zellner (1978) calls it Minimum Expected Loss (MELO) estimate which is found to be the product 
of 1/ B (the mle of 1/8) and a shrinking factor that has a value between 0 and 1. Zellner (1978) shows 


that the MELO estimator has finite moments and bounded risk relative to squared error loss function, 


whereas, the mle of 6, 8° =1/ B , does not possess finite moments and has infinite risk relative to SELF. 


For the heteroscedastic model, the MELO estimate of 1/8 is 


ae [2 } ae b =H aint *] , (9.27) 


for w, = I/x?. 


Heteroscedastic Disturbances 


Consider simple linear heteroscedastic model (9.12) defined in Section 9.2. Suppose the joint prior 
distribution of (B, r) is a non-informative g(B, r) « 1/r, then the marginal posterior distribution of B is 


a 3-parameter t-density with (n-1) df, B,, =y/x location parameter and scale parameter 
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2 
ly yi _(y 
(n Hf =e p where w, = I/x,. The Bayes estimate of Bis y/x when the loss 


nX 4 SCX, x 


i 


function is SELF. 
However, if we assume r to be known so that r, = Iw, then the posterior distribution of the 


regression coefficient B is NBs ry) wa | with respect to the prior N(®,,7) for B, where 


i=l 


B= [Bry wx; +B, led WX; | The Bayes estimate 8, of B under linex loss is 
i=l i=l 


n 
Bory” WX; + TB, 
— i=l 


n 
t+ry, Ww,X; 
i=l 


A 


L 


-l 
a a 2 
T+r W.X; ,a #0, 
( L ; ‘ (9.28) 


—=(c4+nrx)" which tends to ~——*— 
2 x 2nrx 


, _ mry+tp, 


so that, for w, = I/x,, , B. ast—0O and for 


T+nrx 


nim |x 2nr 


i 


w.= 1/x/’, B, = b . as tT 0. In particular, when the parameter a of the linex loss tends 


N jj=1 


i 


to zero, B, —> y/x or ay, 2 according to as w, = I/x, or w, = 1/x,. The result is not surprising 
x 


since the linex loss function is approximately squared error loss function for arbitrarily small values of 
lal 
If we use Zellner’s (1994) balanced loss function 


LB, B)=o). (y,-Bx,) +d-w)" x7(B-B) ; Osos (0.29) 
then the Bayes estimate B is given by 
B=of, +d-@E@|2). (9.30) 


B, +B, rn 


since B,, = y/x and E(B|z) = — 
tT+nrKx 


, for w, =1/x,. 


In particular, if we take the vague prior, that is, T> 0, 


B=, =y/x. 
So, we observe that the Bayes estimate under balanced loss function reduces to that under SELF 
and also to OLSE. 
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2 the error variance of 
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Example 9.2. (Error variance proportional to square of the regression function). 
Prais and Houthaker (1955) considered heteroscedastic model in which 6; 
the ith observation, as proportional to the square of the regression function, that is, 


o- =o’ (8, +B,x,)’. This regression model is applicable when the expenditure depends on the 
We shall like to obtain Bayes estimate of 8 for a simple regresson model with no intercept term 


household expenditure of various consumption goods. 
i=1,2,....n, 
where u,’s are independently distributed like N(0, 0). If o” is assumed to be known, and taken to 


y, =Bx, +u,, 
be unity, then y; ~ N(Bx;,B’x; ) , fori = 1, 2, ...n. The likelihood function of B, given z, is 


_ " 1 Yi } 
&(B|z) BP ‘| ap : ')} 


Let us denote t=y, / X,, SO that 
(Bl z« 2 exp 7 y (t; 6) | 
IB] 28° 4a 
The conjugate prior for B is a generalized inverse normal distribution 
Jews Robert, 1991). 


- 1 ) 
g(B) < |B| oa|-a(5-H) 


1 
ae eae 
with normalising constant k7! = tte" ?* 2 2 r( 


2 
«J] 
1/2 
n 4 


The posterior distribution of B, given Z; 1S 
(B|z) = [B[™ exp] -—| —- 
= "| 20 (B 
which is also a generalized inverse normal distribution with hyperparameters (O,,[,,7,), where 
Lt 
1 


(9.31) 


1 


a-1 1 we 
27 


a-1 
Jal 9? my 


2 2 
(9.32) 


1 
-l 


Bayesian Inference for the Linear Model 303 


2 Hr +(, -1)(o, -B)t 
Hy Ly +O, (Q, Z 3)%; ‘ 
on using the approximation given in Robert (2001, page 495) 


F (azb:2) = £0) eZ p fae) 
ee Ee Ne zi2 | 


In particular, for non-informative prior g(B)«<|B|', we obtain approximate posterior mean of B by 


making 1 > 0, t 0 and a 1. We have 


2 
2)" t B | +n(n—3)o" )” te 
F i=l i=l 


E(B|z)=—= : 5 - (9.33) 
Yt » “| +(nt)(n-20°) v 
a i=l i=l 
If we use the relative squared error loss function 
A A 2. 
L(B,8) = (1-B/B) (9.34) 
then the Bayes estimate of B is 
B=E(6"|z)/e(B°|2). (0.35) 
Since, 


22 n+2=1 = _ 272 
= mt ps Pavan bce! F n+2 Ldn ps? 
s 2 2 *2' 4 


and 


n+2 25-2 n+2 252 
- t n+2 n+2 1 nt 
E(B2|z)=k"(s?) 2 exp} ——— 22 Tr F = (9.37) 
(B°z)=""(e) | aie i | 
the Bayes estimate of B, under relative squared error loss function, is 
242 
r n+1 F Bel, 1 oon u 
6 = Ss Q 2 2s 


9” 
2 ,{n+2 n+2 1 nt? 
r (Ele oe 
2 2 2 2s 
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where t, = y,/x,,1=1,2,...,n. 


Binomial Regression Model 


Example 9.3. (Bird nesting problem, Aitichson and Dunsmore, 1975) Four pairs of a rare species of 
bird nested for the first time in Scotland last season. The observed number of eggs in the four nests 
were 2, 3, 3, 4 and from these nests 1, 2, 3, 3 nestlings survived the season. At the start of the current 
season, a new pair has a nest with three eggs. We are interested in finding the probability that atleast 
two nestlings will survive the season. 

Solution. This is a problem of binomial regression with @ as the probability that an egg from a nest 


will give rise to a surviving nestling. Here x =(2, 3, 3, 4) and y= di, 2, 3, 3) and x, = 3. Consider 


Beta (a, b) prior for 8. Since each y, ~ Bin(x,, 8), therefore the posterior distribution of ©, given z, is 


i=l 


revi peer) 


g(8|z) => 


| e*' (1-6) T] (: i (0) fo 


0 i=l i 


n 


which is a Beta(a,, b,), where a, =a+) y; andb, =b+) X; -y y; and the predictive 


i=l i=l i=l 
distribution of future observtion y,, given x, and Z, is 


1 


a(ys|X5.2)=] f(ys|x3-0)g(@|z)d0 


0 


-[p Pere bi Fae-¥,). 


> Y; =0,1,...,x,, ; 
Ys B(a,,b,) ° : 0 a) 


which is a beta-binomial distribution. 
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Let us assume that the information about parameter 8 is vague and we represent it by taking 
a= b =0, that is, we are considering Haldane’s nil-prior for 6. Thus, the predictive pmf for the surviving 
nestling y,, in the current season, when 9 nestlings survived out of 12 eggs in the last season and 3 
eggs are observed in the nest this year, is 


3 \B(9+y., 6— 
O29: Ys), ¥<=0, 1,2,3. 
ys B(9, 3) 


a(ys |x; =3, o-( 


Therefore, the probability of surviving atleast two nestlings out of 3 eggs this season is 


Che 73 }ae: 2) _ ae 


2) B(9, 3) (3) B(9, 3) 


Remark 9.11. The classical statistician will first estimate the unknown probability 8 of success using 


maximum likelihood estmate 6 = y y; y x, =9/12=0.75, then he will “plug-in” this value of 8 to 


i=l i=l 


obtain the probability of atleast two nestlings surviving this season as 
3 jan, R 3 |, nm \0 
6° (1-6)+| ° [6° (1-6) =0.422+0.422 = 0.844. 
2 3 


The difference between the Bayesian and the classical answer is that in the classical answer the mle 
6 was treated as the true value of the parameter, whereas, the Bayesian approach averages the 
probability mass function for the current season with weighting factor g(®|z)=Beta(9,3) with mean 


value 0.75. 

Remark 9.12. The classical approach of “plugging-in” the estimated value of the parameter in the 
distribution and then using it for future purposes may be criticized on the grounds that it fails to take 
into account the uncertainty of the unknown parameter. However, Bayesian approach incorporates such 
an uncertainty in a natural way. 

Remark 9.13. The plug-in approach in prediction problems may be incoherent. 


Example 9.4. (Lindley, 1999) Let pee, ae emer” a i be n iid random variables from 


f(x |6)=60e ”; x>0,6>0, and let X,,, denote some future independent observation from the same 


density. Then 


PK4 >X,4[6)= | Ge dx =exp(—Ox,,,). (9.39) 


n+l 


n 


It can be shown that the mle of @ is 6, =n/s, s= x.. Therefore, the mle of 


i 
i=l 


exp(—Ox,,,,) is exp(-6, x, ,,) =exp(—nx /s). If we ‘plug-in’ 6, in f(x|9®) for ®, the classical 


n+l 


statistician would use the pdf 
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n 
BX |X) = 5 oma /8), (9.40) 


as the distribution for future purposes. From the Bayesian perpsective, this will be coherent only if 


there exists a prior distribution g(®) such that 
8K. LD=] FO 1 x,0)g(0| x00 
0 


co 


| £6, [© | g(@)a0 


0 


{ fx |6e(@a0 oe 


0 


because the conditional distributions of Xi 


distributions (9.40) and (9.41) imply that 


, and x are independent for a given @. Conditional 


| Be" a(8)d0 
/s)=*— 
| "ec 9(6)d0 


0 


n 
—exp(—nx 
s 


n+l 


or 


co 


| 8"e * [Ferner /s) —Oexp(—®x,,,,) cous =0 (9.42) 


0 


This must hold for all x, n, and s. Let us put x,,, = s/n in (9.42). The expression inside the 


1 


—Os/n 


-1 

ne _ : ee 

brackets, ——-— 9e , has a minimum in @ at 8 = n/s, which is zero. 
s 


Thus, the integrand in (9.42) is non-negative for all n and s. Consequently, the integral cannot 
be zero for any non-negative function g(8) and so no prior distribution g(8) exists. Therefore, the 
method of “plugging-in” the mle of @ is incoherent and a ‘Dutch Book’ can be made against the 
frequentist. 


9.5 FINITE POPULATION PROBLEM 


Example 9.5. Poisson Regression (Bolfarine and Zacks, 1991) Consider a population of N orchards, 
the ith orchard extends over X, acres. The values of X1, Xp seep Xy are known. Let Y» i= 1, 2,..., N, denote 
the number of trees in the ith orchard which are infested with a certain fungus. We are interested in 
predicting the total number T of infested trees in the population. 

Let us consider y,, y,,....y,, to be independently distributed such that y,~ Pois(Ax,), i = 1, ..., N. 
The parameter A represents the average number of infested trees per acre. Let s = {1, 2, ...n} be a sample 
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of n orchards having units with largest value of x,’s. Suppose the prior distribution of A is 
Gamma (u, v). Denote y,= {y,; es y,s X= (x,, oes Xx), y,= (You ay Yuoo and X= (Ks ., X,). The posterior 
distribution of A, given y, and x, is Gamma(u+ ny,, v+nx,) and the predictive distribution of y, is 


ety, y=] g(Aly..x,)e(y, [Adda 


=| Il [ar ow| erMevenszyp wena (VM) a 
aaa T+ ny,) 


N bat > \ut+ny, °° 
X; \we | eg Rv taK, HNR)K,) 9 wtny, HN—M¥, 1G 
1 


T(utny,) 5 


Nox” \(vt+nx,)""™ Tu +ny, +(N-n)y,) 
Yi ! Tu ate ny,) (v a7 nx, + (N inj 


T! N y; Ta oF ny +T. ) u+ny, T. 
-_ r Tt. s (| 3 | 
( ) (9.43) 


The" T(utny,)T,! 


n+l 


where 
(N—n)x, xX, N te = 
= +, T= —, T, = ,, and Nx=nx, + (N—n)x,. 
ale Pe > J ee 


Writing g(y ly.) = g(yJT,, y) g(T, | y), we find that g(T ly.) is NBin(u +ny,,1-y) and g(y, |T., y,) is 


multinomial distribution with parameters T_ and (His Tig roeees yy ) : 


Since mean of NBin(u+ny,,1—y) is (u+ny,)(N—n)x, /(v+nx,), we have 


T=) y, +E(T, ly.) 
i=l 


—n)x, 
=Yoy, +(o4ny, Je Van (9.44) 


Remark 9.14. The non-informative prior distribution g(A) « 1/A may be obtained by making 
u —> 0 and v—- 0. The Bayes predictor (9.44) of population total T reduces to 


T=) y, +2 (Nm, 
i=l P 
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which is the classical ratio estimator of T. 
Remark 9.15. For Varian’s linex loss 


L{T1\- elf) —a(T,-T,)-1 a#0, 
T= ~+ogM,., (-a) , 
a 


where Moy, , the moment generating function of the predictive distribution of T,, is 


((v+nx, )(v+NX-(N-n)x,e*)") 
we have 


v+nx, 
v+Nx—(N~-n)x,e“ 


T = sou +ny,)log ,a#0. (9.45) 
a 


Example 9.6. Normal Regression 
Consier the simple linear regression model, 


y,=Bx,tu,, i=12,..,m 
where u, is N(O, 6’w,), i=1,2,...,N; 6” known and E(u,u, ) =0 for all i# j=1,2.....N. 


N 
We are interested in estimating the population total T=)'y, based on a random sample 


i=l 


{(x,, y,)> i= 1, .....n} drawn from a finite population. Let us denote, as before, y,= (y,; ar y,) and 


n N 
Y, = (Yuu + Yy)- Then the Bayes estimate of T will be Ly ve(Es, | Since each 
i=l 


n+l 


y; ~ N(Bx;, ow; ) is independently distributed of each other, 


Vy, (Bx. °°) w, } (9.46) 


n+l n+l n+l 


Denote A,=)x,.=Sy, A=yx.70=Sy, . 
1 1 


n+l n+l 
Under the assumption of N (B,. o/ m) prior for B, with m>0, the posterior distribution of B is 


—1 


N| B’',o*) —+—+m where g* = ALT. mp nie AS . Therefore, the predictive 


n 
yw yw, yw, 
1 1 
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distribution of T., given y, and x =(x,,X,,....,Xy)> iS 


* 2 2 AY 7 
N| B'A,, 0°} A?} —+—+m | +) w, 


n+l 


Hence, the Bayes estimate of T, under squared error loss function, is 


T= by +B" y * (9.47) 


i=n+1 


» _y 
= mea:( Sw | 
1 


Remark 9.16. In particular, for w, = x,, 


4 
mB, + A,T, [: Ww, 
A 


r 


i=l m+ ee n+l (9.48) 
Further, for the non-informative prior g(B) «1, 


T= Yyi+ uy x, (9.49) 


Kien 


which is the classical ratio estimator of the population total. 
9.6 HYPOTHESES TESTING 


We shall follow the theory developed in Chapter 7 to obtain Bayes factor for comparing 
hypotheses concerning B. Let us consider the simple normal linear regression model with known 
variance. 


Example 9.7. Consider H, :B =B, against H, :8 =B,. Suppose 7, is the prior probability of the null 
hypothesis. The likelihood function of B is 


iy i < 5 
(Bz) (sa | eo ee he Bx; ) } 
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The posterior distribution of B, given z , is 


7, £(B, |z)/m(z), if B=B, 
e(B]Z)= reser |z)/m(z), if B=B, - 
Therefore, the Bayes factor in favour of H, is 
7, £(B, |z)/m(z), if B=B, 
apie baer |D/m(z), if BEB, 
We shall prefer H, over H, if B,, > 1. After simplification, we find that H, should be preferred if 


B>>8,+B), 


where 


rm n n 
2 
B= y XiYj y Xi 
i=l i=l 


Example 9.8. In order to test the sharp null hypothesis H, : B = 0 against H,:B#0, we may use 


Lindley’s approach in which we construct the 100(1-a)% HPD credible interval for B and accept H, if 
the hypothetical value of B under H, lies in it. In particular, if our prior distribution for B is non- 
informative then 95% HPD credible interval for B is 


—1/2 —1/2 
[b-155o{ B+ 1 966{ Fx! } 
i=l i=l 


If this interval contains zero then we shall not reject H, at 5% level of significance. 
Example 9.9. Let us follow Jeffreys’ approach to test the H, : B = B, against H,:B #B,. We shall 
follow the approach discussed in Example 7.12. The Bayes factor in favour of H, is 


_ £@1By) 


01 


m,(Z) 
Since f(z|B,)= ; : exp : Vv, Box; )° 
, ~ oJ2n 20° = ‘ ; 


1 } 1 (Ly; fon 2 1 2 AN? 
- ex : xXx? |}exp| — Xx: - 
c =| | mabe BEX! |Jexn| > orExi (Bo -8) 


m,(z)=| £(@|B)g,®)dB, where g,(B) is N(B,,02). 
B 


After some algebraic manipulations, we find 


and 
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n —1/2; 
1 1 (<x? 1 Pity pig 1 =x a9 
m,(z)= ee exp| — ~—B°dx: || ex 1 ; 
@) c = al o | | ape oe Pr? CLx2 +07 (Bo —P) 


where 


1/2 “1 


2 2 
i 1 lim > 2 ' Q\2 Lx 1 
Bo, = 9% rm oe exp =) yx; o | (By -B) =a tS 
i=l 


0 


Remark 9.17. If we consider x, = 1;1= 1, 2, ...,.n, the problem reduces to the Example 7.12. 
Example 9.10. Suppose we wish to test the hypothesis H, : B < 0 against H, : B > 0. Let the prior 


distribution of B be g(B) «1, Be (-c9,00). The posterior distribution of B, given z, is found to be 


NB o Ext}. Under the 0-1 loss function, the Bayes rule for rejecting H, against H, is 
i=l 


P(H, |Z) <1/2. Thus, we shall reject H, if 


( 1 
J sBloaB<>. 


co 


Thus, the critical region of the Bayes test is 


. a -1/2 1 
= : b} — 2 —}. 
C fo fol < | 


Remark 9.18. The results obtained in the examples of this section may be easily generalised to the 
unknown variance case. 


9.7 SIMPLE CONTROL PROBLEM 


In a problem of control we aim to specify y value, say y = y’, and require to find a corresponding 
x value which is optimum in some defined sense. The Bayesian approach provides solution to control 
problem that is operational, reflect uncertainty about parameter’s values, and allows sequential learning. 

Let us assume that the model under consideration is a simple regression model with no intercept 
term 


y =Bx+u 
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Consider a sample of size n from the above model. Suppose y,,, is the unobserved value of the 


dependent variable y corresponding to x_, andu,u,,.., u.. are iid N(0,o”) random errors with 0? 
n+ 1 2 n- 


1 +1 
known. Let us further assume that neither y,,, nor x, are known. We wish to fix y,,, near a given 


target value y’ and consider the loss associated of being off-target as 


Liye Y Ona 7 2 (9.50) 
Note that the loss function is random since y,,, is random. In order to tackle the problem of 
minimization of a random function, we minimise the predictive expected loss with respect to the choice 


of x_... Since 
n+l 


2 
E(L(y.4.y'))= B(O.. -v¥y) 


2 
= Var(veu ly+(¥"-BO sal] 


=o = +(y —Bx.i) (9.51) 


since g(y,,,|y) is N| Bx,.,, 0 5 


The predictive expected loss (9.51) is minimized for x__, = x’, that is, 


n+l 


—1 


* 2 


x ==~|1+ = 
B™ [Ex +m | 
i=l 


(9.52) 


n 
* 2 2 ; ; : 
From (9.5), we note that 8° and / [ms yx are the posterior mean and variance, respectively, 
i=l 


of . If the posterior variance is small then the second factor of x* tends to unity and x’ =y /B . If 
-1 


; . ; , o 
the prior for B is non-informative then x" reduces to a 1+— 
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Remark 9.19. In case of unknown 0”, the predictive density of the future observation y,,, is a 
3-parameter t-density and we may conjecture 


-1 
‘a Var(B| y) 
= I+ - 
EBly) | E’Bly) O28) 
Note the similarity with MELO estimate (9.25). 
9.8 GENERAL LINEAR MODEL 
Result 9.1. Consider the general linear model 
a ee (9.54) 


where u~ MVN(O, 0°) If the prior distribution of B is g(8) «<1 then the posterior distribution of 
B, given Z, is 


g(B\z) < 4B | z)g(B) 


o exp -bo|(o-8)xx(2-6).(2-»8) (.-8 (9.55) 


Since the second term in the exponential is independent of B, the posterior distribution of B is 


MvN (6, o°(X’X)" ) , where 8 = (X’X)'X’y is the least squares estimator of B. 
In case, 6 is also unknown and the prior distribution of (6,6) is non-informative, such that 


g(B,0) « I/o, then 


1 1 Ay? A Bry? A 
2(8,o|2z) <0" exp| = {ow xByy x)+6-B/xxi6- 0.56) 


Since 


g(B,o|z) =g(B|o,z)g(o|z) , 
we have the marginal posterior distribution of o as Inverted-Gamma [e —k, (y -X Bycy -X B) / 2) and 


the posterior conditional distribution of B , given ©, is Mvy(B 02(X’X)? } 


314 Bayesian Parametric Inference 
On integrating out o from g(B,o| y) , we obtain the marginal posterior density of B as a 

multivariate t-density with (n—k) df, mean B. and precision matrix (n—k)(X’X) y-XBy(y-XB). 

Note that the covariance matrix is a -xBy(y —X B)(X’X)" /n-k—2). 

Result 9.2. Consider the model y =X B+ u, where u ~ MVN(O, oI), 0? known and the prior 


distribution of B is MVN(,,V), then the posterior distribution of B is MVN( BV) where 


, -1 , , —1 

: ‘ : yo 

pa[vire | [vB +558) ana Vv sae = ) ; (9.57) 
oO ~ Oo - 7 


In case, 6? is also unknown and if we take multivariate normal inverted-gamma prior for (6,0) such 


that the conditional prior distribution of B, given 6, is MVN (By ov") and the marginal prior 


distribution of 6 is Inverted-Gamma(v, vs?/2), then the conditional posterior distribution of B, given 


o, is MVN{ Bo" +X)" | and the marginal posterior distribution of o is 
Inverted-Gamma(v*, v's”) and the marginal posterior distribution of B is multivariate t-density with v’ 


df, mean B and scale parameter (V +X’X)/s” , where 
v =n+v > 
B= (VX XD BFOXBY, 058 


and 


V's" =vs? +y’y+B, VB,-B1+(V+X’X)B" 
Remark 9.20. The marginal posterior distribution of the jth component 6, of B is univariate 
t-distribution with df v", mean of jth component B; of B, and Oo; is the jth diagonal element of the 
covariance matrix 

(V+X’X)' [vs + yy + BoVBo = Bw +XX)B )/(n +v-—2) } (9.59) 


Remark 9.21. In particular, for a simple regression model 


y, =B, +B,x, +u,, i=1,2,....,n, 
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where u, are iid N(0, 6”), let us take non-informative prior for B,, B,, and o as g(B,, B,, 6) « 1/o. Then 
the posterior distribution of (B,, B,, 6), given data, is such that the conditional posterior distribution 


of (B,. B,), given 6, is bivariate normal with mean vector (B, B, ) and covariance matrix © and the 


marginal posterior distribution of o is Inverted-Gamma((n—2), (n—2)s”), where 


1 
ee ae ie Xx; 
“=o (XX) =0 ale 
Xx, Lx; 


and 


The posterior distribution of (B,, B,) is a bivariate t-density with (n—2) df, mean vector (B, B, ) F 
and covariance matrix s*(X’X)"'. Hence, the marginal posterior distribution of B,, given z, is 


n —(n-1)/2 
(612) Leo 
g B, |z es (n—-2)++"__——__(B, -B,)” ; 
s°)" x; /n 0.00) 


and the marginal posterior distribution of B,, given z, is 


i —(n-1)/2 
yi, -x)” 


2(B. |z) = (n—2)+= 3 (B, B,)° 


(9.61) 


In other words, each of the following two standardized random parameters 


n(x, = | Vx, =n) 
= (B, B,) and =, (B, B, ) 
s*)' x; (n—2) v2) 


/2 


have univariate Student’s t-distribution with (n—2) df. 
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The (1-0)100% HPD credible interval for B, will be given by 


-1/2 -1/2 
i. [Ee] > B,+t, [Ee =] } 


where t( /2)n-2 18 the (/2)th fractile of the Student’s t-distribution with (n—2) df. 


Remark 9.22. Consider a regression model with two explanatory variables, as in Remark 9.21, which 
suffers from multicollinearity problem and as a consequence the OLSE of B, and B, have very large 
variances. If we choose a prior with diagonal covariance matrix then the covariance matrix of the 
posterior distribution is 


(V+X’x)' [vs + y y+ B, Vv BS B (V+X’X) B 


> 


n+v-—2 


which will not be singular because of the factor V+ X’X instead of X’X. Therefore, the proper choice 
of the prior covariance matrix will solve the multicollinearity problem. In fact, the classical solution to 


the multicollinearity problem, suggested by Hoerl and Kennard, is similar to the posterior mean of B 


and is known as ridge estimator. 
Result 9.3. Consider the general linear model 
y=XBtu, 


with known common error variance 6, and suppose that the prior distribution for B is non-informative. 


We know that the posterior distribution of B is mvn(B oO (X’X)! ) The linex estimate of a linear 
k A za 

combination of regression coefficients 6 = yb, =b’B will be 6=b’B-ab(X’X) ' bo’ /2, since the 
i=l - ~ 


posterior distribution of § is a univariate normal N[WBWOrX)" bo } 
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Large Sample Approximations 


Asymptotic normality of the posterior distribution is the basic tool of large sample Bayesian inference. 
Under some regularity conditions, in particular if the likelihood is a continuous function of 8 and that 


the maximum likelihood estimate of @, 6 is not on the boundary of the parameter space, the unimodal 
and almost symmetric posterior distribution of 8 approaches normality with mean 6 and precision (6 ), 


Fisher Information evaluated at 6, for large sample sizes. It will also be noted that for large samples, 
the likelihood dominates the prior distribution and, therefore, the knowledge of likelihood is enough 
to obtain the normal approximation. Gelman et. al. (1995) give a number of counter examples to illustrate 
limitations of the large sample approximation to the posterior distribution. 

The Bayesian approach to parametric inference is conceptually simple and probabilistically 
elegant. However, its numerical implementation is not convenient since the posterior distributions are 
available as complicated functions. In Section 10.2, we illustrate some of the well-known methods to 
obtain Bayes point estimates and related Bayes risks. 


10.1 NORMAL APPROXIMATION TO POSTERIOR DISTRIBUTION 


The numerical implementation of a Bayesian procedure is not always straightforward since the 
involved posterior distributions are complicated functions. One of the important steps in simplifying 
the computations is to investigate large sample behaviour of the posterior distribution and its 
characteristics. The basic result of large sample Bayesian inference is that the posterior distribution of 
the parameter approaches a normal distribution. If the likelihood function happens to be correct then 
the limiting posterior distribution should be centered at the true value of the parameter. 

Result 10.1. Let X,, X,, ..., X, be n independent observations from a sampling distribution with joint 
density f(x,, ..., x,|6). Suppose g(8) is a prior density for 6 which is positive for all 6 € ©. Under suitable 


regularity conditions, the limiting posterior distribution of (@— 6) /o is standard normal for large sample 
size, where Q denotes the unique maximum likelihood estimate of 8 and 


1 0° logf (x,,...,X, | 9) 
o | 20° (10.1) 


8=0 


It is interesting to note that the large sample posterior distribution of 6 does not depend on the 
prior. In other words, in large samples the data totally dominate the prior beliefs. 
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Proof. Suppose X =(X,,X,,...,X,) is a random sample of n independent observations having a 
likelihood function ¢(8| x). Assume that the prior g(®) and the likelihood function ¢(8|x) are positive 


for all @€ © and have continuous derivatives. Further assume that 6 is the unique maximum 
likelihood estimate of 0. 


Let us write g(6|x)« e(0)exp(log “(@| x). Expanding g(@) by Taylor’s expansion in the 


neighbourhood of 9 = 6 , we have 


e@), 2") , 
g(8) = g(6)| 1+(0 o 8 6) eld 
Denote L(8) = log £(8| x) and expand L(6) by Taylor’s expansion in the neighbourhood of 6 =6 to 
have 
A A 7A 1 A271 "7a 1 Av3T "7A 


Since 6 is the maximum likelihood estimate of 6, L’(6) =0 and L(6) is a function of x only 


(independent of 8), we have 


exp(L(0)) « exp| 2(0- A."@) foxo| 20- ride. 


ee ee ee 
ox exp| 20-6 L [20-6 L Or], (10.4) 


on neglecting the higher order terms in the expansion of the second exponential terms. Thus, g(6| x) 


is obtained by multiplying (10.2) and (10.4) to obtain 


doin en{ 10-816 | 40 £2 1 @_ar2®,! =(0-6L"@)+.. 
2 20) 2 26) 


According to Jeffreys (1961), (0-6) is of order 1/Vn and L(8) is of order n so that the terms 


(0—6)e’(6)/2(6) and (0-6)°L’””(6)/6 are of order 1/J/n , whereas, (@—6)?¢(6)/2(6) is of order 


1/n. Hence, the large sample approximation of g(@| x) is given by 
1 mot A 
g(8| x) « exp 7 0-8 L® | (10.5) 
and involves an error of order 1/-/n . Thus, the large sample approximation of the posterior distribution 


; a? " . 
is the normal distribution with mean @ and variance [10 | evaluated at 9=80. 
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Remark 10.1. In case the population density is such that the range of the random variable X depends 
upon 9, the limiting posterior normality cannot be obtained. For example, if X ~ U(O, 8), the range of 


X is (0, 8), the maximum likelihood estimate of 0 is 6= max(X,,X.5,.-.,X,,)- The likelihood function is a 
monotonic function and, therefore, £(@| x) is not differentiable at 6. 

Example 10.1. Suppose X =(X,,X,,...,X,) is a random sample from Poisson distribution with 
unknown parameter 8. Since 


L(8) = Constant — nd+)" x, log@ 


i=l 


L’(8) = -n+)" x, /O, and L’(@)= -y x, /0,. 
i=l 


i=l 


so that 6 = y x [a 
i=l 
Therefore, the large sample approximate posterior distribution is 
lyY ig 
N|—) x, —)_ X; |. 
E d , n° » 


Example 10.2. Suppose X,, X,, ..., X, are n iid Bernoulli random variables with unknown probability 
of success 9. The log likelihood function of 0 is 


L(8) = yx, logd+| a yx, Jose 0) 


Since, 


ee 1 u 
L (0) noms = Ex | 


i=l 


The maximum likelihood estimate of 6 is 6= yx; /n=x and L’(0)=-n/ (8( 1-8)). 


i=l 


Thus, the asymptotic posterior distribution of 9, given x, is N(x, x(1—x)/n). 
n 
On the other hand, the asymptotic distribution for the statistic yx, /n., using the central limit theorem, 
i=l 


1 ay! ‘ 
is no 30-8) It is so because (-L (6) converges to nI(@|x) and 6=X converges to 0 for 
n 


large n. (See Bernardo and Smith (1994), page 486). 
Example 10.3. Suppose X =(X,,X,,...,X,) be a random sample from Pareto(a, A), A known, with 
pdf 
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Therefore, 


((o.|x)=$ co{ af ats /A) } 


i=l 
x; 


i=l 


L(a) =nloga— a)" log + Constant 


i=l 


La) = be 
i=l 
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Qa 
. n 
L'(a)=-—. 
Qa 
Hence, the maximum likelihood estimate of a 1s 
s-—_™ __ and (-L’(0)),", = /n. 
Yi log(x, /A) 
i=] 


Hence, the large sample posterior pdf for a is N 


a: -llcx 
é, —a |. 
n 


Remark 10.2. In the light of Remark 10.1, the limiting posterior distribution for A, given a, cannot be 


obtained. 


Remark 10.3. In most cases, L’(6) does not differ much from its expectation over X and we have 


E(L’(6)) =-1(6| x), where I(6|x) is Fisher’s information. However, I(6|x) depends on the 


distribution of the random variable X rather than on the observed value x. We may, therefore, replace 


L’(6) by -1(6 |x) which depends on x. Hence the approximate posterior distribution of 8 may be 


considered as N 6, eee ; 
1(8| x) 


Example 10.4. (Lee, 1997) Suppose X = (X,,X,, 


with unknown location parameter 8. So that 


F(x, | 8) = ee i=1,2,...,n. 
m 1+(x, —8) 


...,X,) be arandom sample from Cauchy distribution 
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Since 
L(®) =—n log x) log (1+ (x, -8)”) 
, = (x, —9) 
L’(0) =2 i 
©) ta] 
and 
L’@)=2)) (sm) ao 
=| (14+(x, -6)°) 
therefore, 


(x—0)?=1 
(1+(x-)°) 


1(0| x) =-E(L’(6)) =-2nE 


4nt (x—6)’-1 
™ > (1+(x-6)") 


dx 


Putting z = tan y, 


24 n/2 ; nI2 
da = | sin’ Ycos’ wdy — | oie 
» (1+z’) 0 ; 16 16 8 


Hence, the large sample approximation of the posterior distribution is N(6, 2/ n) : 


Remark 10.4. In order to obtain mle for 8, we may either use the method of scoring, 
6,,, =9, +L’(6,)/1(, | x), 

or Newton-Raphson method 
8.41 = 9, -L'@,) /L"(6, ), 

where the starting value @, may be taken as the sample median. 


Remark 10.5. The normal approximation of the posterior distribution may be extended to more than 
one parameter case. In particular, let 8, and 0, be two unknown parameters having maximum likelihood 


estimates 6. 6, , respectively. 


°L(8,,8;)_ 


Denote A= (A). where i = 30.90.” i, j=1,2. The approximate posterior distribution of 
aa 


(0,, 9,) is the bivariate normal distribution with mean vector (6.. 6,) and variance covariance matrix 
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—A' evaluated at (6, 6,). 
Example 10.5. Suppose X =(X,,X,,...,X,) is a random sample from N(u,o7). Here 0= (U,6°) , 


where both p and o* are unknown. In this case 


L(u,o a o -—— Xx, - + Constant, 
(H,0°) =-Snlogo’ — = x, a)” 


i=l 


n 4 1 <q 22%, n = 
a logo xX, -X x) +Constant . 
7 ES 5 Li 7X) FSD 


OL n(u—x) 
ou c > 


oL n 1 “ 
— + (x, -x)° +n(u-x)? 
do = =20° 20" [x C 
Hence, the maximum likelihood estimate of 0 is 


0=(f, 6°) 1-5 ae =| 


In order to work out A matrix, we note that 


aL n — — 
xe Dew (Le x)’ +n(U—X) } 


and 
YL _ n(-x) 
dude? ~—s (a) 


—/6* 0 


Since, A evaluated at (fi, 67) is . 
e 0 —n/2(6’)’ 


} the large sample approximation to posterior 


n 
distribution for ((1,67) is bivariate normal with mean [= y (x, -x)’/ "| and covariance matrix 
i=l 


VG, —x)7/2’ 0 
i=l 


0 {Eo 9" | jr 
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We note that the off-diagonal elements of the variance covariance matrix are zero, and therefore, the 


. : err : . ly a) 
marginal posterior distributions for u and 6? are independent N| x, a —x) | and 
i=l 


lg ay. ny 
‘| =), - 3X), ‘(Ec =] } say 
nN j=1 nN \ i=l 


Remark 10.6. Note that if u is known, the maximum likelihood estimate of 67 is Vix, —w)’/n and 
i=l 


the approximate posterior distribution for 0° is 


ig , aie a 
nto i), =(es 1) i 


Remark 10.7. The 100(1-%)% approximate HPD credible interval of the scalar parameter 0 based on 
normal approximation of the posterior density is 
—1/2 
0=6 | 


, 3? —1/2 , a“ 
9=Zy wo] “aap LOD) OP Liceggl eae LAD) 
6=6 
where Z,,_,)/. is the (1-a)/2 fractile of the standard normal distribution. 


80 00 


Thus, for a sample x of size n from a Pois(®) population, the 95% approximate HPD credible 


interval for 0 is [x96 xr.96 |= | 


10.2. APPROXIMATION TO POSTERIOR MOMENTS 


Quite a few times, the integrals appearing in Bayes estimation cannot be expressed in a closed 
form when the chosen prior distributions are not conjugate priors. In particular, we come across 
evaluation of posterior expected value of h(®) which involves ratio of the integrals 


| h(6)é(8| x)g(®)d® and | £(8 | x)g(®)d® . Lindley (1980) considered evaluation of the ratio of the 

(2) (0) 

integrals | h(6)¢(8| x)g(@)d@ and i £(8| x)g(8)d® , which is nothing but E(h(6)|x). Later, Tierney and 
(2) (2) 


Kadane (1986) gave another analytical approximation for its evaluation. In this section, we shall illustrate 
the use of these two methods for approximating the posterior moments. 


Lindley’s Approximation 


Let us consider the case of a scalar parameter 9 of the distribution having pdf (pmf) f(x|@). 


Suppose the likelihood function has a unique maximum 6, maximum likelihood estimate of 0. Let us 
denote 
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é d'L as 
h,(0)==—hO), L=s—S, i= 12... 
xf 20 a 7 ya _ 0u(8) 

and 6 -( a , u(8)=logg(9), we Sa 
Then, 
| b@)2@| xg(@)0 
E(h@)|x)=2 
| 4@|xg(0)a0 
(2) 
| h(6) exp(L(0) + u(®))d® 
~ | exp(L@)+u@)) ae 
e 
where 


L(®) =log (|x) =)" log f(x, [6 , 


i=l 


and 
u(8) = log g(8). 


Then, Lindley’s approximation, for larger n, of B(h(6) | x} is given by 


E (h(@)|x) =h(@)+[h, 6) +2n, Gu’@ Jo +—[L, Gh, Ho". (10.6) 


In particular, if h(@) = 0. 


~~ re) a A2 1 A a4 
E(®|x)= o{ Su) iy {546 p (10.7) 


Remark 10.8. The approximation is of O(1/n) and the first term neglected is of O(1/n’). 

Remark 10.9. The result holds for proper as well as improper priors as long as posterior distributions 
are proper and posterior expectations exist. It is so because in the case of improper prior, the integral 
in the denominator is a normalising constant. 


Example 10.6. Suppose X = (X,,X,,...,X,) is arandom sample from f(x | 6) =6*(1-6)'*; x =0, 1 


and suppose our prior for 6 is Jeffreys’ non-informative prior g(@) =c0"'*(1— 6)”. We are interested 
in obtaining approximation for E(6|x). We have, 
h(6) =6, 


u(@) = -> (og 0+ logd— 9))—log C; 
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due) _-1, 1 20-1 


00 @ 1-0 20(1-6)’ 


g-1 X,;=X, and gst) 
nin n 
Using (10.7), we get 
2x-1 x(-x) , 1 2n(-x) x-—x)" 
2x(1-x) on 2x dx 2 


n 
ee, Wall 
=X-—| x-— 
a2) 


From the Remark 10.8 


Be@ls)=¥-2(%-5] (=) 
n 2 n 


The Bayes estimate of 8, under SELF, with respect to Zellner’s MDIP for the parameter 8 can 
be obtained as follows: 
Since 
g(8) = cb%(1-6)'*, 


u(8) = logc + Blog 6+ (1— 8) log(1—8) 
u, (8) = log 8 —log(1—9) 


E(6|x) =x+ 


Hence 


E(| x) = 6+(log6—log(l js Bagel > | 
n 


2n? 61-6) n 


= x+ | [Xd-9 (logs —tostl-2))+0-20)]+0[ | 
n n 


Remark 10.10. The first term is of O(1), the second term is O(1/n) and the correction term is of 
O(1/n’). 


Example 10.7. Suppose X =(X,,X,,...,X,) is a random sample from the Poisson distribution with 


unknown parameter 6 and the chosen prior distribution of 8 is lognormal with density 
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(loge wv 0>0, 


(8) = : cx 
. 6,/27 r 20 


where (1, ) are known hyperparameters. Since, 


L(@) = —n0+ y x, log 89+ Constant, 


i=l 


L,(0)=0 gives 6=x 


A 1d n 
L,(0)=--~ X,=- 
ve eh z 


i=l 


5 A 1 < 
o =—, and L,(0)=—2) x,=—. 
O=— LARS 


Further, 
u(9) = Constant — log6— x (log @—p)’, 
and 
v= 1 1 2(log 8-H) 7 1 igh logX—u 
8 20 6 x ) 
Hence, 


E(0|x) =x = Bee JES. : 
x o n xX n 


Results 10.2. For two independent parameter case, the Lindley’s approximation for E(h(6,,8,) | x) is 


1¢ = oO 
E(h(,.0,)|x)=h@,.0,)+5 97h, ,,8,)0, +) h, @;,8,)0, 
i=l i=l i 


1 vi 
+5109 (h,(6,, 9,)L,5 + h,(6,, 9,)L, )ah, (6,, 8, )67 Li 


+of > (10.8) 
n 


6.6: 


+h, (8, ? 9, )oZLy; 


where 6 = (6,,6,) are the mle of (0,,0,). 
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i+j 2 
h(6,,8) L, = aL Oe d°h(,,8,) 
20, i 96100! : 20) 


h,(6,,9,) = , 6, =(,j)th element of the 


matrix (-L,)", Oi = 0 for i # j, since 6, and @, are independent. 
Example 10.8. Suppose X = (X,,X,,...,X,,). be a random sample from N(u, 67) and the joint prior 


for lf and © is Jeffreys’ non-informative prior so that g(U,o)«1/o. Here 0=(0,,0,)=(U1,0). 


Since, 


Luo) =). log f(x, [H.0) 


i=l 
1 n ; 
= constant — nlogo — —y (x; —L) 
20° 4m , 


the maximum likelihood estimate of (U,o) is 


and 
6<¢ n 2¢ 2n 
Li o (x, -), Ly = 7? L, Vix, —p) L,, = aes 
i=l i=l 
n 3 ¢ 7 —2n 12¢ 5 
Lo = 2 rm Le HW) Ly = re RE Vi, HW)’, L,, =0 
oO i=l Oo i=l 
Therefore, at the mle (fi,6), we have 
a2 a2 
A fo} " fo} 10n 2n 
O,=—, On =~, L3 =0, Ly; = a3? L,=a: L,, =0 
n n 6 


E(h =f 
(h(u,0) |x) ee Ou u es Ou u Ou 


Fe dh(u, 0) Cy 0(—-log 0) mn 1 6,8, dh(u, 0) Lat dh(L, 6) Ly 
00 00 2 ou 00 


+o( i } 
hoa n° 
(1,6) 


1 um 4, FHL) | aes d(—logo) 
22 


oh(u, 0) CG Lx 4 oh(u, 0) 
Ou ere} 


+ 


2 
O5,Lo3 


Writing h(u,o)= independent of 6, we have 
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Bin|s)=*+0( | 
n 
since 


Oy Oh _ O(-logo) _ op _ =0. 
aw do? ou = 0G 


Writing h(u,o)=o independent of UW, we have 


Biola) =6[1+2 }+o[ | 
4n n 


Tierney and Kadane Approximation (1986) 


We observe that Lindley’s approximation requires evaluation of third order partial derivatives of 
the log-likelihood function which may be very cumbersome to compute when the parameter 0 is a 


vector valued parameter. Tierney and Kadane obtained approximation for E [ey | x using Laplace’s 


method. Let us define 
nL(8) = log /(8| x) + log g(8), 
and 
nL (6) = log ¢(6| x) + log g(8) + log h(6) . (10.9) 
such that nL(Q) is concentrated around a unique maximum 6. 
Expanding L(®) and L*(6) in a Taylor’s series about their respective maxima ) (the posterior mode) and 


Q" , we have 


(0—6)° 


L(6) = L(6) + (0—6)L, (6) + L,), 


and 


L' (6) =L'(6')+(6- 6:6) + 0 eur L}(6'), 


where 


i 


L, (8) = <0 and L;(0) = or (8), 
Since, L,(6)=L,(6') =0, we have 


E(h()| x) = [ exp(L'(®))d0 / Jexp(L()) 40, (10.10) 
2) 2) 


Thus, for large n and some suitable regularity conditions, the two integrands in (10.10) are kernels of 


normal distributions with respective mean 6° and @ , and respective variances 
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-(L,6)) and 6? =-(L,6) , 


we have, 


c(t @)) for OF” ey y: 0 fe 
(2) 


an(td)foo{ 5" 6)’ L, 6) 
° 


E(h(8)| x) = 


Tierney-Kadane approximation, is given by 


E(h(6)| x) = © exp{n (U@)-L®)} (10.11) 


Remark 10.11. According to Tierney and Kadane, this approximation is of O(1/n’) as that of Lindley’s 
approximation. 

Remark 10.12. This approximation holds for functions h(@) which are positive and bounded away from 
zero. Tierney, Kass, and Kadane (1988) have shown that the approximation can be modified so that it 
will hold for general h(6). 


Example 10.9. Suppose X =(X,,X,,...,X,,) is a random sample from exponential distribution with 


unknown parameter 8 with pdf 

f(x|®)=— ! e*/®: x, 8>0. 
Suppose g(@)« 1/6’. Since 

nL(6) =—n log0-2 x —2log 0, nL’ (8) =—n log 05x —2log 0+log 9, 
therefore, 


6= bx fn+2, 6 = Ex, ino, 


and 
6 =67 /(n+2), 6” =67? (n+). 


On substituting for 6 and 6", we get 


n+l (1/2) e 
E(0@|x)= Xi: 
oT) Gal eye ‘ 


Example 10.10. (Example 10.9 continued) Let us consider estimation of the reliability function 


R(t) = P(X >t) =| seas =e? 
t 
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The Tierney-Kadane approximation of E(e'/® |x) is 


A 


B(e/* |x) = exp {n (U@)-Le)}. 


n+2 n+l 


since nL’ (0) =—(n +2) log f(y x, +t |p. so that 
i=l 


i=l 


6" io ie: reffor and 6? =6? (n+2). 


Remark 10.13. The procedures developed by Lindley (1980) and Tierney-Kadane (1986) for approximate 
evaluation of ratio of integrals may also be employed for obtaining marginal posterior densities. Suppose 


the parameter @ is vector valued, say 6 = (0,, 8,), having g(@,,6, |x) as the posterior density. Since 


the marginal posterior density of @,, given x, is 
2(8, |x) = | g(6,,0, |x)d®, 


| 2@,.6,)| x)g(0,, 8, )49, 
| 2,8, | x)g(@,, 6, 40,0, 


10.3. BAYES INFORMATION CRITERION (SCHWARZ CRITERION) 


Bayes information criterion (BIC) is often used as a substitute for full calculation of the Bayes 
factor since it can be calculated without specifying prior distributions. BIC is also known as Schwarz 
criterion and is given by 


BIC = log | x)- Flog n (10.12) 


where IG) |x) is the maximised likelihood value, k is the number of explanatory variables in the model 


including the constant and n is the sample size. 


In order to prove the Schwarz criterion, suppose the observations X =(X,,X,,...,X,) are 


drawn from a population with pdf f(x| Q ). Let the prior distribution of @ is g(@). Let us assume that 
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6 is the posterior mode. Then writing h(@ ) = log (f(x| @) g(@)) and expanding h(@ ) by Taylor’s 
expansion about the point @= ) , we have 
~ oh yee n 

h(@) = h(@) +> @-8)'n"(O-9. (10.13) 


Note that @ is a k-vector parameter. 


Thus 
m(x) = | £(8| x)g()d0. 


= [exp (ndr)exr{ , @-Sye"Gre-6 lo 


=exp(h@)) [ exp] 5(0-9)9" OB) fa 


Since the integrand is proportional to the multivariate normal density with precision matrix A =—h “(6) ; 


the marginal density of X is 


m(x) = exp (h@)) (2n)*? |AL!? (10.14) 


Since for large n, posterior mode is same as maximum likelihood estimate (under some regularity 


conditions) and A=nlI where I is the expected information matrix for the single observation, therefore 
| A| =n*|I|. Thus, we have 


log m(x) = h(6) + : log 27 : log| A | 


RB ak k 1 
= log fG|x)+ log 209) log 2a 5 en 5 sli: 
Since g(@) is MVN 6, I) (this is so if the prior is equivalent to a single extra observation), we have 


a | k A 
log g(0)= i |I| og oe 2, since the exponential term at 8=0 is one. Thus, we have 


A 1 k k k 1 
logmn(x) Slog AG |x) leg|! (>, logan seen 5 logn 5 esl 


- k 
= log ((@| x) — logan. (10.15) 


This quantity is known as the BIC. It penalises models which improve fit but at the cost of introduction 
of more parameters in the model and, therefore, serves as a measure of simplicity of the model. 
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Example 10.11. (Leonard & Hsu, 1999) Suppose X,, X,, ..., X, is a random sample and we have two 
competitive models which may fit the data. The model M, is Gamma (a, 8), where & is known but B is 


unknown, whereas, model M, is Lognormal (0, 6”) where both 8 and o? are unknown. In order to 
compute BIC for model M,, we note that 


log €(B| x) = log] ] om 7 “| 


i=l r@) 


— nalog B —nBx —n log T(a) +(&-1))° log Xj, 


i=l 


and the maximum likelihood estimate of B is B= o./x. Therefore, BIC for model M, is 


B, =log (B|x)->logn 
=ndlog [<} ns{s : nlog I'(a)+(a— ny log x; — oe n 
x x a 2 


= nates} (a+log T(a))+ (a-))? log x, — 2 ioe n. 
x 2 


¥] 


0)" 


For model M., k is 2 and 


i=l 


log £(0, 0° |x) =log]] on 


n > x 1 
——log2no’ — ) log x; — 
ae »» 36? 


i=l 


Therefore, the maximum likelihood estimate of @ is i-[ toes, \ and that of 0? is 


>=) (log X, —6)? 2n . Thus, BIC for model M, is 


i=l 


B, =log 0(6, 6° | x)-Flog n 


n 
= Pe lame ” log 6 Vlog, - ag hes —6)’ —logn 


n n m2 © 
= log 27 logo log x. —n-—logn. 
5 log 2n—— log Yi log x; g 


i=l 


The Schwarz criterion chooses model M, if B, > B,. 
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Remark 10.14. We may also use BIC to compare two models M, and M, having marginal densities 


f(x|M,) and f(x|M,). The Bayes factor in favour of model M, is defined as 


_fG1M) _ mop 
f(x|M,) m,(x)’ 


where m, (x) =f(x|M,), i=1,2. 


12 


Using (10.15), 
log B,, = log m,(x)—log m, (x) 


‘ . k, -k 
= log ¢, (8, | x)—log 2, (8, }-[ BSE os n, (10.16) 


where, model M, has parameter 0, of dimension k,i=12. 
Remark 10.15. Note that, the difference of BIC for model M, and model M, is just an approximation 
to the log B,, and the relative error is of O(1). 


Remark 10.16. If we write A= f(x|6,.M,)/f(x|6), M,), then for M, CM,, -2logh=% 4, 


under the hypothesis that M, is the true model. 

Remark 10.17. The quantity on the right hand side of (10.16) is often called the Bayesian information 
criterion (BIC): some workers defined it with an added arbitrary constant (see Kass and Raftery, 1995)). 
Remark 10.18. Another commonly used measure of goodness of fit is the Akaike’s information 
criterion (AIC) and is given by 


AIC = log ¢(6|x)—k. (10.17) 


The AIC picks the model that gives the best approximation, asymptotically, in the Kullback-Leibler 
sense. Some authors feel that the AIC has a strong bias toward models that overfit with extra 
parameters. 


In general, information criterion may be defined as log nC) |x)-@k/2. For « = 2, we get 


Akaike’s information criterion and for & = logn (or log(n/2m)) , we get Schwarz criterion. 


Chapter 11 


Other Topics 


In many data analyses, theoretical and empirical considerations lead to statistical models that account 
for changing distribution of random variables. Problems of changing models are found in a wide variety 
of disciplines including econometrics, and biology. In a short description, given in Section 11.1, we 
provide methodology to deal with estimation and detection of a possible change in the underlying 
model and also indicate how to predict the future observation when a change has taken places. 

In the next two sections, we shall briefly discuss the procedure to deal with unknown 
hyperparameters of the conjugate prior distribution, namely, empirical Bayes and hierarchical Bayes 
estimation procedures. 

Another problem of interest to a Bayesian statistician is concerning effects of misspecification 
of the inputs for drawing inference/decision. Such a study of sensitivity of an inference or a decision 
to a change in one or more of the assumptions concerning the model, prior, and the loss function is 
known as robustness of the inferential procedure. A statistical investigator uses robustness studies 
to develop confidence in the mind of the client, since the specification of the model and the loss 
function is only an approximation to the true situation. 


11.1 CHANGE POINT MODEL 


Change point models are used to describe discontinuous behaviour in a stochastic phenomenon. 
The shift point indexes where or when the shift occurs. It is a discrete parameter treated as a random 
variable. In Bayesian set up, the prior distribution of the shift point gives the nature of the change to 
be expected. There are two fundamental problems of interest. 
(a) Has a change taken place among the variables under investigation during the obervation period? 
This problem is called the detection problem because our interest is to find a change in the relationship. 
This problem can be solved using Bayes factor. 
(b) Assuming that a change has occurred, can one estimate the parameters of the model? For 
example, one may wish to estimate the place or time of the change as well as the pre- and post-change 
parameters, those that explain the before and after relationships between the models. 


Mathematical formulation of the change point problem 


(a) Estimation 


Let XxX, X,, cue Xx »4 ms xX, be a sequence of independent random variables such that 


m+1? 


f(x|@,) i=1,2,..,m 
X, ~ 
'  |f,(x]0,) i=m-+l,...,n. 
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with a change point at m, where m is unknown. If g(0,,0,, m) is a joint prior distribution of 8,, 0,, and 
m, then the joint posterior distribution is 


g(0,,0,,m| x) « €(6,,6,,m|x)g(0,,8,,m), 0,€ O,,8,¢0,, m=1,2,...,n—1, 


where, 


¢(8,,8,,m|x)=[] f(x; |8,) > f,(x; |8,). 
isl 


i=m+1 


The marginal posterior distributions are 


g(m|x)= I. lL. 2(0,,8,,m|x)d0,d8,, m=1,2,....n—-1, 


n-l 
28, |x)=)0 i. 2(8,,9,,m|x)d®,, 8, € O, , (11.1) 
m=! - 


n=l 


(9, | x)= y I, 2(9,,9,,m|x)d6,, 8, € O,. 
m=1 

We may use these marginal posterior distributions to obtain Bayes estimates of the parameters of 
interest. 

Remark 11.1. The literature on change point problems using Bayesian approach employs a discrete 
uniform prior for the change point m over its entire range which is also assumed to be independent 
of 8, and 0,. Bayesian approach is flexible to entertain any probabilistic prior information about the 
change point. One may, therefore, use a partially informative or a non-informative prior for the change 
point. 


(b) Detection of a change 

This is basically a problem of comparing two hypotheses, namely, there is no change versus there 
is exactly one change during the period of observation. In order to entertain the possibility of no change 
in the sequence, we should include m = n in the range of possible values of the change point m. We 
frame the null hypothesis H, : m = n against the alternative H, : m= 1, ..., n—1. 

Let us assume, as before, that the parameters 0, and 0, are a-priori independent of the change 
point m. Assume that the marginal pmf of m is 


p if m=n 
m)= = 
Bon) SD de Santi O<p<l. (11.2) 
ce 


Note that under H,, there is no change in the model and, therefore, the joint prior distribution 
of (0,, 6,, m) is 

2(8,, 8,, m) = g(0,)g(m) (11.3) 
since the parameter 0, does not appear in the model. However, under H,, the joint prior distribution 
of (0, 8,) is assumed to be 2(8,, 9,). Thus the likelihood function of 0. 0,, and m is 


I] £6: 19) if m=n 
£(@,,8,,m| x)=) * : 

I] £6: le d[] £616.) if m=1,2....0-1, 

1 


m+l 
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and the posterior distribution is 


2.1] £,(x, |6,) tse, 
0,,9,, o 
a eal f(x, 16)] | £266, |8,)¢(@,,0,) if m=1,2...n41. ve 
m+l 
The marginal posterior pmf of m is given by 
pf, eT] fx 128, is 
—— (11.5) 


a ae le I £,(x, 16] ] f,(x, |0,)g(0,,0,)d0,d0, if m=1,2...,.n-1. 
m+1 
The constant of proportionality is 


n= m 


I] £,(x, |@,)] [£:(x, |©,)2(6,,0,)d8,d8,. (11.6) 
©, ©, 


i=l m+] 


(De) =r £10] [és 1840, + PY | 

The posterior odds in favour of hypothesis of no change is 
P(m=n|x) 

O(H, | x) = 1-P(m=a|x) 


p » 2 )] J f, (x; | ®,)d®, 
es 1 
l-p n-l m n 
ie JI] f,(x; | 8,) I] f,(x; | ®,)g(0,,0, )d0,d0, 


We may, therefore, infer that H, is less likely than H,, if O(H, | x) <1. 


(11.7) 


Remark 11.2. Sometimes the value of p is also not known. We may then consider p to be a-priori 
distributed like Beta(a, b). The marginal prior distribution g(m) of m becomes 


g(m) =| g(m|p)g(p)dp 


1 al b-1 
jp 2 p_d-p)", 5 a 
_fo B(a, b) 
1 a-l 
(= PP 4 (i=p)"" ——————dp if m=1,2,...,.n-1 


ly if m=n 
= (11.8) 


~ |bKa+b)(n—-1) if m=1,2,...,.n-1. 
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In case, we wish to use diffuse prior information about the distribution of p, we may use Bayes-Laplace 
uniform prior for p, in which case a = b = 1, and we have 


1/2 ifm=n 
1/2m-1) ifmén. 
Remark 11.3. If the investigator is interested in introducing consequences of a wrong decision, he may 
employ an appropriate loss function to test H, against H.. 


Remark 11.4. Smith (1980) considered the problem of change point as that of model comparison. Let 
us write a model M_, of change point at m as 


g(m) -| (11.9) 


fy %y,x,(M)= | [Tae 19.) T] £0, |,)2@,,0,)48,d®,. 


, 0, i=l i=m+1 


If M, is the model which assumes no change, we have 


£ (XX py0Xq |My) = | [£0 | 6g, (0,)d6,. 
J} 


i i=l 
This formulation reduces the set of alternative models to M) M,, bees M, 
models pairwise using Bayes factors 
_ £(&,,X,.-..X, | M;) 
YFG tk, (M- (11.10) 


,, We may now compare the 


The Bayes factor tells the contribution of the observed sample when the effect of prior is partially 
eliminated. 


(c) Prediction of future observations 
Assume that a change has taken place at some unknown time point (or place) m. Let us denote 


6" SG Siscak,) ond x =(x 


X x ..X,) so that x= Ga? x”) . Then 


m+1?“*m+2?°° 


g(6,|m,x")«<]] £,(%; |8,)g(,). (1.11) 


m+l 


If y = x, , is the (n+1)th independent observation from f,(y|®,) , then 
e(y|m,x) = g@, |m,x)£,(y]0,)d8,, (11.12) 


and the predictive distribution of y, given X, is 


n-l 
g(y|x)= y I. I, g,(9,,8,,m| x) f,(y | 8, )d0,d0, 
mel 1 YO; 


n-l 
=) J, J, 2m l92@, [m.x')2@, | m,x" f(y |8,)49,09, 


m=! 


n-1 
= Yeatm|x)), 28, |m,x")dd,[, 20, |m.x" f(y |0,)d8, 


m=! 
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n-l 


= deat] ety |m.x), (11.13) 
Example 11.1. Suppose that X,, X,, .... X, are independent normal random variables having pdf 
N(O,, 1) ifi = 1, 2, ...,.m; Genes N(6,, ‘1: ‘where 1<_m<n-I and m, @,, 0, are unknown parameters 
with 0, #8,. Note that the sequence is changing in the mean. 


Let us consider the case in which we know that a change has occurred but do not know when. 
Our interest is to estimate the change point m. The likelihood function is 


18,.8,m|3) 9 ue (x, -0,)° + y (x, —0,) th 


i=m+1 


where 0,,0, € (—ce,co), m=1,2,...,.n-1, and x =(x,,X,,...,X,,). Let us consider that a-priori 8,, 8, and m 


are independent of each other and that they are also non-informative so that 2(8,, 8,) ox 1; 0,, 
0, € (—ce, cc) and g(m) = I/(n-1), m = 1, 2,..., n-1. The joint posterior distribution of 0. 0,, and m is 


2(0,8,.m|3) exp {% (x, -0,)? + y (x,-0,) |e. 0.0m, 


i=m+1 


where the constant of proportionality is 


(pw) =¥ { J 2(0,,0,,m|x)d0,d0, 


n-l 
a 


1 1 i —m a1 i (n—m) aoe 
“ae Dor ae exp| 3.8.48) ]f exp| 200,-X) Jea.J exp| - 5) (8, —x,) Joo. 


co 


“aa exp tos, + s.| 2m(m(n-m))"”’, 


where, 


m 


x,=)) x,/m, X,= yx, /n- m) =), (x; -x,)’, and S, =)" (x, -x,). 


1 m+1 m+l 


Thus, 


2(6,.8,.m|x)=— [-S, +8,)/2]exp| -m(, -%,)*/2 Jexp[-m —m)@, -¥,)" /2] , 


ony exp[-(S, +S,)/2](m(m—m)) 


m=1 


The marginal posterior distributions of 0. 6, and m are 


n-l &% 
2(8,|x)=)° [ g(6,,0,,m|x)d0, 


m=1 —co 
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oe v2 f/m ae) 
y exp[-(S, +S,)/2](m(n-m)) i exe[-m@,-x) /2] 


= ’ 0, € ( 00,00), 


y exp[-(S, +S,)/2](m(n—m)) 


m=1 


n-l @ 
g(9, | x) = Y | g(0,,9,,m| x)d6, 


m=1 co 


y exp[-(S, +$,)/2](m@—m)) oe exp[—(n- m)(0, —X,)°/2 | 


— m=l 


, 9, € (—29,00), 


y exp [-(S, +S,)/2](m(n - m)) 


m=l 


and 


g(m | x)= i i 2(8,,0,,m| x)d0,d0, 


co —0o 


(m(n-m)) "” exp[-(S, +S,)/2] 
y (m(n- m)) exp[-(S, +S,)/2] 


m=l 


, m=1,2,....n-1. 


The predictive density of the future independent observation y from N(8,, 1) is given by 
n-l 
g(y|x)=)° g(m|x)g(y|m,x) 
m=l 
where, 


e(ylm,x)= | g(@,|m,x”)f, (y10,)d0,. 


=) 


Since the posterior distribution g(6, |m,x‘”) is N(x,,(n-m)") and the predictive distribution 


g(y|m,x) is N(x,,1+(n-m)"), we have 


= _ yy a n-m —(n—-m) as 
L (m(n m)) exp| (S, +S,)/2] | OO X,) 


g(y|x)= — 
(m(n-m)) "” exp[-(S, +S,)/2] 


Remark 11.5. We notice that the marginal posterior distributions of 8, and 0,, as well as, the predictive 


distribution of y, given x, are finite mixtures of normal densities with weights 


W,, = (m (n-m))” exp[-S, + S,)/2], m = 1, ....n-1. 
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Remark 11.6. If we are interested in predicting the future mean of k successive observations 


Y,» Y» -» Y, then in the above expressions we have to use f,(y|®,) in place of f,(y|®,). Since 


f,(y|®,) is N(,, 1/k), we have 


n-l = -1/2 = (n-m)k —(n—m)k a 
y (m(n- m)) exp[- (S,+S »)/2] 


8 
uu 


Remark 11.7. In case, X,, X,, ..., X, is a sequence of independent random variables such that the first 
m observations are from NO. r) and the last (n—m) observations are from N(@,, r), where the precision 
r is also unknown, we take the joint prior distribution of 0. 0,, r such that 


g(6,,8,,r)=g,(8,,8,|r)g,(r); 8), 0, € (—°, ©) and r>0, 


where g(0,,8, |r) is a bivariate normal distribution and g,(r) is a gamma distribution. We shall see that 


the marginal posterior distributions of 8, and 9, and predictive distribution of y are mixtures of 
t-densities. If we wish to construct the HPD interval for 8, 9,, or the future observation, we may have 
to use numerical integration technique, since all these dicta butlids are finite mixtures of respective 
distributions. 

Example 11.2. (Example 11.1 continued) In order to decide whether the sequence changed atmost once, 
we shall test the hypothesis H, : m = n against H, : m = 1, 2, ..., n-1. The likelihood function is given 
by 


(2m) °°" exp| -(S+n(X-8,)° )/2| if m=n 


6,,8,,m|x)=4 
(2) exp| -(S, +S, +m(X, -8,)? +(n—m)(x, -6,)” )/2| if m=, ..., n-l, 


where xX = y? x, /n and s=)° (x, —x)’. 
1 


i 
1 


Suppose 0, 0,, m are a-priori independent such that g(0,,0,) «<1; 0,,0, € (—c°,00) and 


p if m=n 
m)= 9 
B(m)=)I-p ip m=1,2,....n—1 
n-— 


where p is some known constant between 0 and 1. The posterior distribution of 0. 8,, m 1s 


p(2n)°” exp| - (S+n(0, —x) )/2| if m=n 
(9,95, XK) o = 
: oe [at en" ex[-(s, +S, +m(0, -x,)?+(n—m)(6, -x,)')/2| if m=1, ..., 1-1, 


where the normalising constant 
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n/2 oo 
pe=[ 55 | | exp| (5+ m0 xy) uo, + 


aa | cx] E15, +8, +0)" (0x0, 7] 10,60, 
m=1 ~ oo co 


(4) [psso(-$ +5 [=P aol-(5 +5,)2) oe] 


The marginal posterior distribution of m is given by 


oo 1 n/2 1 _ 
g(m| x)= [ols exp] 4 (S406 —xy )| coe, 


n/2 
= Pe | eee” 1, aes 
D,(x)\ 2a 2 n 


and for m#n 


7 FL (i-pV iy? 
al] Torey 


exp| 215 +S, +m(6, —x,)’ +(n—m)(@, —x,)” ) d0,d8, 


1 fi-—p Ly ex 55.) 2n 
D,(x)\ n-1 }\ 20 . 2 }|.fm(@m=m) 


Thus, the posterior odds in favour of H, is 


1 21 
exp| ——S |,/— 
pexp|-4s], |” 
n-l pee 
l-p os} S, +8, 2n 
mi (n—-l 2 m(n—m) 
Pp no- 


2 2 exf 15] ie oof (829 eco} 


We shall, therefore infer that there is no change in the sequence if O(H, |x) <1 
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Outlier Problem 


The problem with outliers is related to change point problem because the mathematical structure 
for both of them is almost the same. 

Consider a sequence of n independent observations Xx, X,, ies XxX.» Xp aie X,. Let m =k if, and 
only if, kth (k = 1, 2, ....n) observation is the outlier. Suppose that the outlier Xx, has a pmf (pdf) 
f(x,| 8,) and the other (n—1) observations have a pmf (pdf) f(x,| ®,); i = 1, ..., nm; i #m. The problems 
are “How do we identify the outlier?” and “What are the estimates of 8, and 0,?”. 

Let us assume that m is independent of 8, and @,. The prior probability mass function of m is 
g(m) = I/n; m= 1, ..., n, and that the joint prior density of 0, and 0, iS 2(0,, 8,). The likelihood 
function is 


¢(8,,0,,m|x)=|[] £; 18.) jf, |®) (11.14) 
i=l ® : 


igm 
The joint posterior distribution of 0. 0,, m is 
g(8,,8,,m| x) x /(8,,6,,m|x)g(8,,8,)g(m) (11.15) 


The marginal posterior distributions of 6,, 8,, and m are 


g(8, |x)= » i g(9,,,,m| x)d6,, 
m=1 2 


2(8,|x)=)) |, 2(0,,8,,m|x)d8,, (11.16) 
m=l , 


and 


g(m| x)= im I, 2(8,,8,,m | x)d0,d6,. 


These marginal posterior distributions may now be used to obtain Bayes estimates of 8,, 08,, and m. 
Example 11.3. Consider a sequence of independent normal variates such that X,~N(O,,1), eS 1 
i#m and X, ~ N(O,, 1). Let us assume that m is independent of 0, and 0, and that its prior mass 
function is g(m) = 1/n, m = 1, ...,n and that the joint prior density of 0, 0, iS 2(0,, 8,) x |], 
0,, 9, € (-9, 2). 


exp » (x, -8,)? i exp[ (x, -6,) (2) 


igm 


pa fF [ew|-¥ (x 872 foxp[ (14 ~03)*/2]00 0, 
m=1 ifm 


g(8,,8,,m| x)= 


ry 


exp(-S,, /2) exp[-(n-1)@, X)"/2] exp |, -x4)*/2] 
y exp(—S,, /2) 


m=! 


c) 
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where S,, =) (x, -X,,)° and x,, =)" x, /((n-I). 


i#zm i#m 


The marginal posterior distributions of 0,, 8,, and m are given by 


26,|x)=) |” g(6,,0),m|x)d8,, 


m=1 


Ye ex(-$_/2),/ 5 exp (0-1), -¥4)"/2] 


= eS _____., 6 € (-«0, «), 


y exp(-S,, /2) 


m=l 


y exp ( S_/2) p= ex| (9, K,)/2| 


2(8, | x)= a i 
y exp(-S,, /2) 


m=l 


’ 0, € ( 00, 00), 


and 


g(m| x) =exp(-S,, /2) Ds exp(-S,,/2), m=1,2....,n. 
m=! 


As before we see that the marginal posterior distributions of 8, and 8, are the mixtures of normal 
distributions with weight functions w,, =exp(-S,, / 2), m= 1, 2,...,n. 


In order estimate m (to identify outlier), we may obtain the mode of the marginal posterior 
distribution of m. Bayes estimates of 8, and 6, may similarly be obtained as modes of the corresponding 
marginal posterior distributions. However, if we are satisfied with posterior mean as the Bayes estimate, 
we have 


E(@, | x) -y X,, exp(-S,, /2) y exp(-S,, /2), 
m=1 m=! 


E(6,|x)=)) x,, exp(-S,,/2)/}° exp(-S,,/2), 
m=l m=! 


and 


E(m|x)=)> mexp(-S,, /2) y exp(—S,, /2). 
m=1 m=1 


11.2. EMPIRICAL BAYES METHOD 


Empirical Bayes procedures utilize past data as a means for bypassing the necessity of 
identifying a completely unknown and unspecified prior distribution having frequency interpretation. 
Some authors include those cases in which a prior distribution form is stated up to the values of the 
prior parameters that are then estimated by means of past data. 
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In the Empirical Bayes framework, we assume availability of the outcomes 
X,, X,, «, X, of n independent experiments, the ith observation x, being generated by the pdf (or pmf) 
f(x, | 0), where these 9 values are unknown. We may assume that these 0. 8, sity 0. themselves arise 
as a random sample from the prior distribution g(®). Our interest is in inference concerning the 
parameter value 9 (possibly generated by the prior distribution g(6)), when the current experiment 
yields observation x. In other words, we assume that there exists a sequence (X,, ©), (X,, O,), ons (X,, 
©), ... of independent pairs of random variables (X, ©) such that at the time the current or nth decision 
is to be made, the sequence of observations X,, Xj, ++) X_, as well as Xx, being available to the 
investigator. The sequence of parameter values 0. 0,, ans 0. remains unknown. 

In an empirical Bayes decision problem, we are interested in obtaining a decision function of x 


based upon x,, x,, .., X, represented by d(x) = O(x:: X,, X,, «.., X,). When x is observed, decision 8(x) is 
taken and loss ¢ (8, 5(x)) incurred. The sequence x,, X,,..,X, 1s used to estimate the prior distribution 


g in such a way that 6(x) approximates the unknown Bayes decision function 5,(x) based on the actual 
prior distribution for 9. 

The empirical Bayes methods may be classified in two different ways. One division is between 
parametric empirical Bayes and non-parametric empirical Bayes. In the former, one assumes that the prior 
distribution of @ is in some parametric class with unknown hyperparameters while in the latter one 
assumes only that the 0.’s are iid random variables. A different characterization of empirical Bayes 
analysis can be given according to its operational aspect. One may use the data to estimate the prior 
distribution, or the posterior distribution, or represent the Bayes rule in terms of the unknown priors 
and then use the data to estimate the Bayes rule directly. 

The non-parametric empirical Bayes approach supposes that a large amount of historical data is 
available to estimate the prior and hence places no restrictions on the form of prior. Robbins (1955) 


sought a representation of the desired Bayes rule in terms of the marginal distribution m(x) of x and 


then use the data to estimate it rather than the prior distribution. For example, consider independent 
observations X,, X,,...,.x, such that X, ~ Pois (8,), i = 1, ....n and that the 0, are themselves iid from a 
common prior pdf g(-) defined over (0, o°). The marginal distribution of X, is 


m(x,)= | f(x, |®)g(®)d®, j= 1,2, «on. 
0 


where f(x, | ®) is the Poisson pmf. 
Suppose that our interest is in estimating 0, using the posterior mean. Since 


E(@, |x,) = [8,2(, |x, )d®, 
0 


ll 
o> 8 


O,f(x, 19, 800,38, /n6,) 


fe, 
0 


(x, +) v 
Te J f(x, +1]9,)g(, )d9, 


2" exp(-8,)2(0, 49, jv.) 


n 
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m(x, +) 


=(x, +) ate) 


Let us estimate m(-) by its empirical density 


~_. _ the number x; equal to j 
inj) = = 
n 
The estimated Bayes estimate may be obtained by replacing m(x,) and m(x, + 1) by their respective 


empirical densities m(x,) and m(x, +1). The empirical Bayes estimate of 8, will be 


@2p number of x; equal to (x, +1) 
number of x, equal to x, 


Example 11.4. Suppose we have a sample of size n = 10 from Poisson distribution and the observations 
are 0, 1,5, 2, 4, 6, 9, 6, 4, 2. Empirical density estimate of m is 


1 
Then the Bayes estimate based on these ten observations, when x, = 0, is Se 1. However, if 


0 
x, = 9, then the Bayes estimate is a =0. 


Remark 11.8. We observe erratic behaviour of the empirical Bayes estimate when the sample size is 
small. It may, therefore, be necessary to improve the estimate using some smoothing technique. 
Remark 11.9. Infact, the above kind of result holds for the members of the exponential family of discrete 
probability distributions of the form 

f(x | 8) = 6* exp(C(8)+ V(x)), (11.17) 


for which the Bayes estimate (under SELF) as the posterior mean is 


B(@|x) = [Of («| 6)g(8)d0 /m(x) 
= ({, 0**" exp(C(0)+ V(x +1))exp(V(x)— V(x + 1)) (649) /mx) 


m(x +1) 


= exp (V(x)- V(x+ 1) aie 


(11.18) 


However, such a result may not hold good if we employ it for the members of the exponential family 
of continuous probability distributions. 
If f(x | 6) is a member of continuous exponential family of distributions given by 


f(x | @) =exp(9A(x) + B(®)+C(x)), (11.19) 
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then the Bayes estimate of ® may be written as 


1 (mx) 
E(O —C(x) |, 
(8| x) He ie ( | (11.20) 
where 
of (x | 6 
m’‘(x)= |. : a1 ) 9(@)d8. 
In particular, if ae is normal with mean 6 and known variance 0”, then the Bayes estimate of 0 is 
E(0|x)=x+o7 x) 


m(x) ” 


and the empirical Bayes estimate of 8, based on past observations, will be given by 


; io coo 
A’(x){| m(x) 


Note that the estimates m’(x) and m(x) are more complicated than the simple estimates of m(x) 


obtained in the discrete case. 
The feasibility of estimating a prior distribution g depends on the possibility of finding a pdf (or 
pmf) g satisfying the relationship 


m(x) = [fo | 8)g(6)d0. (11.21) 


The pdf (or pmf) m(x) is often called mixture of pdfs or pmfs while f(x|®) and g(6) are referred to as 
the kernel and mixing distributions, respectively. When f(x|6) is a pdf then (11.21) is a Fredholm integral 
equation of the first kind. The solution of this equation requires the concept of identifiability. In the 
empirical Bayes estimation problems, we have information about m(x) from the past sample 


observations. If m(x) is reasonable enough to identify m(x) then one may try to solve (11.21) for g(6) 
for a given f(x | 8). 

If we are able to assume that g belongs to a certain parametric family of distributions then the 
problem reduces to knowing the hyperparameters of g(8). 
Example 11.5. Let f(x | 6) = (1 — 8)6*, x= 0, 1, ... . If we take g(8) to be the conjugate prior density which 
happens to be a beta density 


e* (1 -@)F" 
g(8| o,B) = ae a,B>0- 
B(a,B) 
and then the mixture density is 
B(a+x, B+1) 


m(x |a@,B) = 
B(a,B) 
Thus, the hyperparameters o and B of the prior density g(6|a, B) are also the parameters of the mixture 
density m(x | o, B). These hyperparameters may be estimated from the past data. We can use any one 
of the classical methods of estimation to estimate o and B. 
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Example 11.6. Let f(x | 6) =e°6* /x! and g(6|a,B)=B%e "6°" /I'(a), «8 >0. Then 


mcleb)=| B ars) } ): x =0,1.... 


B+1} P(@r(x+)| B+ 


is the negative binomial distribution having mean equal to o/B and variance o/B’. Suppose we wish 


to estimate & and 8 by the method of moments. If sample mean xX and sample variance s? of the past 
data are available, then 


X =6/ and s? =X+*7/G. 
Solving the above two equations, we have 


= = 

4 x ~ xX : ie te 

&=——~, B=——~—, _ provided s° > x. 
Ss’ -x s’ -xX 


Suppose we wish to use the method of maximum likelihood. Denote u = o/B and v = a. Then 
the likelihood function of u and v is 


7 I(v+x;) wo ae 
ane I] wee (5) a | 


_( J py Pen Y( 
(is) TI neta ts ; 


The maximum likelihood estimate may now be obtained by partially differentiating log likelihood with 
respect to u and v, and equate them to zero. We have 


are (u,v) =0 
ou 


yielding 4 =x. 


Now instead of partially differentiating with respect to v, we substitute =x in f(u,v) to have 


aon f 1 Vl Tete lh vy Tf =z ¥ 
Hon] TI vandted | 


Then V is that positive value of v which maximises ((i,v), V=+oo is a permissible solution. 


The following result provides a useful relationship between prior and mixture distribution 
moments. 
Result 11.1. Suppose (8) and 67(8) are the mean and variance of f(x | 6). If v and 7? are the mean and 
variance of the mixture density m(x) then 
v =E(u(9)) (11.22) 


and 


1’ =E(o7(0))+E(u(@)-v) , (11.23) 
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where the expectations are taken with respect to the prior density g(6). 
Proof. 


v=E(X)= | x m(x)dx 


=| x I! f(x | @)g(0)d0 jo 
x (3) 


- | g(0) | xf (x | @)dx fe 
(3) 


% 


= | g() 1(0)d0 = E(4(6)). 
(2) 
Similarly, 


: =| (x—v)? m(x)dx 


= EO ! (x—v)'*f (x | @)dx he 
i) x 


= E(E(x-v)’ |@) 


2 


= £[E{(x-w@) +2(x —1(8))("U(8)—v)+(u(8)-v) Ho 
= E(o°(0))+E(u(®)-v) 


since B| (x—-w(6))(u()-v)|6] = 0. 


Remark 11.10. If (6) = 6 then v = E(6) , that is, the prior mean. 
Remark 11.11. If in addition, 67(8) is independent of 6 then 


v=0 +E(@—E(6)) =o’ + Variance of the prior distribution. 


349 


Example 11.7. Suppose X ~ N(8, 1) and the prior distribution for 0 is N(u, 07). Since the conditional 
density of X is normal and prior for 8 is also normal, we know that marginal density of X is also normal. 
If the data gives the mean of X as | and the variance equal to 3, then using the above remarks, we 


have 
1=v=E(®)=u, 


and 3=77 =I+0°. 
Thus, the prior density should be N(1, 2). 
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Remark 11.12. The empirical Bayes method is not Bayesian in the true sense because it approximates 
the prior distribution by frequentist methods. However, these methods are aymptotically equivalent to 
the pure Bayesian methods and may be considered as an acceptable approximation in problems for 
which a genuine Bayesian approach is too complicated or too costly. 

A criticism of the parametric empirical Bayes approach is that it depends on frequentist methods 
like method of moments and method of maximum likelihood to estimate the hyperparameters. 

It can be argued that the improvements which empirical Bayes estimators bring on classical 
frequentist estimators owes to their mimicking of the Bayesian approach, whereas, their suboptimality 
can be attributed to the refusal to adopt a fully Bayesian paradigm. 

The empirical Bayes approach may be considered as a two-stage estimation procedure in which 
the hyperparameter is estimated from the marginal distribution and then the parameter is estimated 
using a “pseudo-prior” where the hyperparameters are replaced by the estimates of the first stage. 
Example 11.8. Suppose X,, X,,..., X, are k independent observation such that X, has Bin(n, 0.) 


distribution. Suppose that the parameters 0; (1<i<k) are distributed according to the Beta(a, B) 


distribution. The Bayes estimate of 8, under SELF is the posterior mean 
Q+ x. : 
E(®, |x,) = pee p= 1,2, sak 
o+B+n 
We may use the method of moments to estimate a and B by noting that marginal distribution 
of each X, is beta-binomial. If & and 6 are the method of moments estimates of « and B, respectively, 


A 


then the empirical Bayes estimate of 0, is ——“i_. 
&+B+n 


The method of moments estimates & and B may be determined as follows. Recall that the marginal 


distribution of X, 
1 a Qt! 1-0. B-1 
m(x, | a8) =| nx; a Cal 
X. 


B(a.,B) 


0 i 


=(2 |piarx, Brn-x9 [Bie 
Xj 


is a beta-binomial distribution with E(X,) = na / (a + B) and 


va= nap ser} 


(a+B) | w+B+1 


If the historical data provides x and s* as the mean and variance of the marginal distribution, the 


je (1-8;) 


method of moments requires solving for & and B from 
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since 


provided &, B >0. 


11.3 HIERARCHICAL BAYES MODEL 


Suppose we have observation x having the density f(x|6) that depends on an unknown scalar 
parameter 0 for which we have a prior density g(@) having hyperparameter 9. 
For example, if we have an observation x coming from a normal population with mean w and 


known variance 6”, we employ conjugate prior distribution for [1 as N(\1,,6;) . The parameters lu, and 


6,” are the hyperparameters. So far, we have been assuming that 1, and 6,” known. However, sometimes 


either u, or O ; , or both are unknown. In order to perform the Bayesian analysis, we may either estimate 


the hyperparameters using the past data or specify prior for the unknown hyperparameters. The prior 
for the unknown hyperparameters is known as Hierarchical prior (or hyper-prior). We may use non- 
informative priors to simplify the process of specifying hierarchical prior distribution. 

Let us consider a decomposition of the prior distribution g(®) in conditional distributions g,(6|8,), 
g,(8,|6,), .... (8, ,|®,) and a marginal hyperprior distribution g,,(6,) such that 


2(0) = | | -f g,(8| 8, )g, (8, | 9,32. (Oy | 8,24 (0, )d0,d8, ---d6,. 


The conditional prior distribution g.(0, |6,) is called the hyperprior of 0, 
is called the hyperparameter of level i; 1 <i<k. 

Example 11.9. (Robert, 2003) Consider X, ~ N(u,, 10), i = 1, 2, ...7; which represent yearly independent 
measures of the intelligence quotient (IQ) of a child, for seven consecutive years. Since IQ tests are 
supposed to account for an age effect, it is reasonable to consider that the [1,’s have the same mean 
8, the true value of the IQ. A corresponding first level prior distribution is g,(u|6) as N(®, 6,°), 
i= 1, 2,...,7 and 0, known. If the child belongs to a thoroughly-studied population of children, we may 
introduce a second level of prior g,(0) as N(8, 6. ); where 9, and G,> are known second level 
hyperparameters, otherwise we may very well take g,(8) a non-informative hyperprior for 0. 

Result 11.2. Suppose the data x is drawn from the population with pdf f(x|6) and assume that the first 
level prior of ® is g,(6|®,) and the marginal hyperprior distribution of 0, is g,(0,). Let us write 


and the hyperparameter 0. 


1 


f,(x |8,) = [f(x | 8g, | 8,0 


f(x |9,)g,()) 
Jf: 18g, (0,)48, 


(8, |x)= 
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and 


f(x | 8)g,(8| 9,) = F(x |8)g,(0| 8) 


g(0 | 0,, x) — 
[£1 @16)4@ fi 18,) 


The posterior distribution of 6 is, therefore, 


g(8| x) =| g(8|9,,x)g(8, | x)d8, (11.24) 


Remark 11.13. The decomposition of the prior distribution into conditional prior distributions and 
marginal prior distribution is seen to hold for the posterior distribution of 8 as well and this 
decomposition also holds for posterior moments. In particular, 


E(g(8)| x) =E(E(g(®)|®,.x)|x), (11.25) 
where the first expectation on the right hand side is taken with respect to g(®|x) and the other 
expectation is taken with respect to (00, x). 

Remark 11.14. The hierarchical Bayesian framework X ~ f(x|6), 0 ~ g(0|0,), ....0,~ g,,,(0,), may be 
considered as the usual Bayesian set up with X ~ f(x|6), 8 ~ g(8) where 


(0) =| 2,(6]9,)g,(8, [8,)---2, (8, |9,)8q.1(8, 49, +d, (11.26) 


Thus hierarchical approach enjoys the general optimality properties of the Bayesian approach with some 
additional advantages. 

Example 11.10. Suppose the data x is drawn from Pois(A). Let the prior distribution of A be 
Gamma(2, 8) and the hyperprior for B is non-informative g,(B)< 1/B. In order to obtain the posterior 
distribution of A, we follow the steps of Result 11.2. Since 


f,(x |B) = f(x [Ag] B)ar 


= BF esongeeig, BOD ory... 
x!y 


(B+1)*” 2 
f i 
g6|y-_—H@Be®)  _ 2 Tia Be (0, ~), 
| £6 [B)g, Bap 
and 
_ f(x | Ng (A | B) a (B+1772 eR OYA 2 
g(A|B,x) F(x |B) (4b! , A€ (0, &), 
we have 
a(A| x)=] g(A|B,x)g(B| x)dB 
eat cS he 7 ex 
- ean! i T(x)” READ) 
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Thus the posterior distribution of 1 for the above hierarchical Bayes model is Gamma(x, 1). 
The posterior mean 


Ed) =E(EAIB 018) =E| Fr |> 


B+1 
Since 
B(A|B,x) =| Agr |B,x)da 
f (B+ e*PP Vdd x42 
, (x+1! B+1° 
and 
x+2 7° x+2 
Fn ef 5 I 7 8B 1x08 


_f {( at2 \Bx@t+t 
i sale? * 


Remark 11.15. An advantage of hierarchical Bayes approach is that it reduces the arbitrariness of the 
hyperparameter choice and, in some sense, averages the Bayesian answers based on conjugate prior 
distributions. 

The averaging on the unknown hyperparameters reduces sensitivity of Bayesian analysis to the 
choice of prior distribution. 
Remark 11.16. Hierarchical Bayes approach suggests a compromise between the Jeffreys’ non- 
informative distributions which are sometimes difficult to justify and the conjugate distributions. 
Remark 11.17. In general, knowledge about the hyperparameters is often quite vague and, therefore, 
the hyperpriors are frequently chosen to be at least partially non-informative. Berger (1985, page187) 
suggests that constant informative priors on hyperparameters may resolve difficulties regarding improper 
posterior distributions coming up if the hyperprior, for example, g(o*) = 1/0? is used in the analysis 
involving unknown mean and variance as the parameters. 


11.4 ROBUSTNESS 


A Statistical procedure which is insensitive to departures from the underlying assumptions is 
called robust, a term introduced by Box (1953). The results of robustness studies are fundamentally 
an appeal to conscience. Bayesian robustness is basically a ‘what if’ game, that is, we say ‘what if 
this is assumed’ then we expect the observations to exhibit certain values as compared to ‘what if that 
were assumed’, and decide which assumptions are more useful for the present problem. 

The notion of robustness provides the investigator a qualitative property which once assigned 
to a statistical tool, induces confidence in its use even though the theoretical assumptions were not 
fully verified or did not hold exactly. The importance of a robustness study in a Bayesian analysis is 
that it can reveal the sensitivity of analysis to a particular feature of subjective specifications that may 
have been assumed too quickly and without appropriate reflection. The problem of robustness has 
always been an important element of the foundations of statistics. 
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Box and Tiao (1964, a, b; 1968) have made important contributions both to the study of Bayesian 
robustness and to the derivation of robust procedures. They observed that statistical theory depends 
on specific assumptions concerning the shape of the parent population. They distinguish between the 
inferences made about the parameters and the criterion used to draw inferences about the parameters. 
Inference robustness means that inferences made about the parameter(s) on the basis of the data do 
not change substantially with the change in the model, whereas, the criterion robustness means that 
the sampling distribution of the criterion used to draw inferences about the parameter(s) under the 
original model is not substantially affected by changing the model. 

Criterion robustness is often employed in classical inference. It is concerned with inference 
criterion (such as estimator) appropriate to the specified model f,(x|6) and examines how sensitive the 
distribution of that criterion is to change in the model. For example, if the specified model is that 
X,, X,, «.., X, is a random sample from N(, 6’), an appropriate classical criterion for inference about ® 


is Jn (x —9)/s. Criterion robustness would then examine, how the sampling distribution of this criterion 


varies from the t-distribution with (n—1) df as the distribution of sampled population varies over the 
class of possible distributions. According to Box and Tiao (1973) this approach is inadequate to examine 
robustness because as a model varies, the inference criterion should also change. On the other hand, 
inference robustness examines whether the optimal inference is sensitive to the changes in the model, 
and is the natural way to examine robustness in the Bayesian framework. 

Remark 11.18. It is also possible to study robustness of posterior inference when either the conditional 
distribution of x given 9 or the prior distribution of 8 is misspecified. Box and Tiao (1973) have 
examined robustness of posterior inference when the sampling distribution of X belongs to a family 
of exponential power distributions of the form 


1 Qe 
f(x |0,0,8) = ko" ex ea 
(x |0,0,B) =ko™' exp 2/6 


py SRS (11.27) 


where k" is P(1+(1+B)/2)2"*)”?,  >0, Oe (-c0,00), and -1<B <1. 


Here, the parameter 8 may be regarded as a measure of kurtosis indicating the index of the non- 
normality of the parent population. 

Bayesian robustness is the sensitivity of Bayesian answers to user’s inputs, namely, the prior, 
sampling model, the loss function, and the data. Most of the work on Bayesian robustness has 
concentrated on imprecision of the prior as the choice of prior has typically being an issue raised by 
frequentists. Functional forms of the sampling distribution and the loss function are also inputs in the 
analysis. Detailed reviews of the literature from a Bayesian view-point can be found in Berger (1984, 
1990, 1994) which also include extensive bibliographies. The sensitivity of Bayesian answers to the 
choice of priors can be achieved through marginal distributions, through posterior expected loss, and 
through Bayes risk. 

There are three main approaches to study Bayesian robustness. The first is the informal 
approach, in which a few priors are considered and the corresponding posterior distributions are 
compared in terms of relevant posterior functionals. 

The second approach is called global robustness. In this approach, one considers the class of 
all priors compatible with the elicitated prior information and computes the range of posterior mean or 
some other posterior characteristics as the prior varies over the class. 
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The third approach, called local robustness, examines the rate of change in inferences, with 
respect to changes in the prior, and uses differential techniques to evaluate the rate. 
Example 11.11. (O’Hagan and Forster, 2004) Suppose that the distribution of X is N(@, r) where the 
precision r is known. Let us assume that the prior distribution of unknown mean @ is N(, Tt). We know 
that the normal distribution is conjugate prior for unknown mean @ and we quite often take it as a default 
prior because the Bayesian inference is analytically tractable with this choice of prior. 

Let us examine the sensitivity of the posterior mean to the choice of N(u, T) prior distribution.The 


+ 
posterior distribution g(6|x) is Neate] having the posterior mean E(0| x) =(tu+rx)/(t+r). 
t+r 


In order to examine the effect of misspecification of the prior mean, let us change Lt to u + 6. The 
posterior mean with this new prior, that is, N(uw + 6, Tt), has posterior mean 


E(6| x)= (t(u+8)+rx))(t+r). The posterior mean changes by the amount 16/(t+r). We note that 
dt/(t+r) 0, if prior precision T is sufficiently small relative to the observation precision r (that is 


t/r +0). We shall say that the posterior mean is robust to misspecification of the prior mean LL. 


In order to examine the effect of misspecification of the prior precision on the posterior mean, 
let us change prior precision tT by an amount 6. Then the posterior mean changes to 
((t+8)u + rx) / (t+6+1). Thus E(6|x) changes by an amount r6(U—x)/(t+r)(t+5+1r) as the prior precision 
changes from tT to t+6. The posterior mean will be insensitive (or robust) to the change in the prior 


precision if |'4—x|—>0 or the precision r of the distribution of X is sufficiently small. Thus, posterior 


mean will be robust to misspecification of the prior if the prior distribution is relatively vague. However, 
if the likelihood function is relatively flat, that is, the precision r is relatively small, the posterior mean 
will be insensitive to misspecification of the likelihood function. 

Example 11.12. (Berger, 1985) Suppose that the prior for the normal mean 8 is thought to be from 
the normal family of distributions. An investigator determines that the median of the prior is zero and 
the quartiles (that is 1/4-fractile and 3/4-fractile) are -1 and 1. For the normal distribution, mean and 
median are equal. Thus, the prior mean is U= 0. For the N(0, 6”) prior distribution, P(@ < —1) = 0.25 and 
P(O < 1) = 0.75, we have o” = 2.19. Thus, the prior may be chosen to be N(O, 2.19) density. However, 
if we assume that the prior is Cauchy having zero median and the two quartiles as —1 and 1, the 
appropriate prior distribution is C(O, 1). Thus, either C(O, 1) or N(O, 2.19) density may be considered 
to be a reasonable prior for 8 on the basis of given prior information. The question is whether the 
Cauchy or the normal prior should be used for obtaining posterior mean. 

Table 11.1 
Posterior Means 


10 


0.52 1.27 9.80 
0.69 1.37 6.87 


We observe that for x < 2, the posterior means under the two priors are quite close to each other. This 
suggests that the posterior mean is insensitive, to some degree, to the choice of the prior. However, 
for x = 10, the posterior means are quite different from each other and we may infer that posterior mean 
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is not robust to a reasonable variation in the prior and that the robustness also depends on the 
observed value of x. 

Most of the efforts in the study of Bayesian robustness avoid the choice of loss function. In a 
decision theoretic framework, specification of loss function can be even more severe than that of 
specifying a prior distribution. It is important to note that a decision problem will, in general, be 
sensitive to the specification of the loss function for small or moderate errors. For example, when we 
consider a weighted squared error loss function @(8)(0 —a)’, then the Bayes decision (estimate) is highly 
sensitive to the choice of weight function (8). There is not much literature relating to loss robustness. 
Example 11.13. (Dey, Lou, Bose, 1998) Suppose that X ~ Pois(®). The natural conjugate prior for @ is 
Gamma(q, B). The posterior pdf of 6 is Gamma(o+x, B+1). Let us consider the class of linex loss 


functions L(6, 6) = exp(a@- @))-a@ —0@)—1, where a #(Q. We know that for small values of |al, the 


loss function is close to the squared error loss. The investigator is not sure about the value of a. He 
can atmost suggest an interval (a,, a,) in which the shape parameter a of the loss function lies. Our 
interest is to know whether the posterior expected loss is sensitive to the choice of a or not. Since 
the moment generating function of the posterior distribution is 


t —x- 
MaO=[I-35 | ? 


the Bayes estimate 6, of 9, under linex loss, is 


n x+Q a 
6, = log] 1+—"— 
ae | aah (11.28) 


and the corresponding posterior expected loss is 


E(0,0, ) = B| exp(aid, -6))-a(6, -6)-1| 


= log My, (-a) + aE, (9) 


x+Q a 
“a [+—a+Brne( +55 (11.29) 


which is minimum for all Bayes estimates ‘ 6, > under the linex loss function. 


Since the posterior expected loss for any estimate 6, under linex loss, is 


E[ exp(a6- 6))-a(6— e)-1| =e E(e) —a(0-E(@))-1 


=e°M a)—al 0 se 1 


—(Q+x) 
ab a ~ X+Q 
aed it 8 1, (11.30) 


the range of posterior expected loss for a€ (a,,a,) is 
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(+X) (+X) 
6 a 6 a ~ xXt+Q 
R(x) =e"? | 14— —e*?| 14—2 a,—a,)| 9 : 
() 25 | a | a of na 


if 0 < a, < a,. The range of posterior expected loss evaluated at the Bayes estimate 6, of 8, when 


aj=l,a,=2,a=1= B is R(x) = 0.2123 (1 + x). Since R(x) is small for small values of x, we may say 
that the loss robustness is achieved for small x. 

Remark 11.19. Since there is a duality between loss and prior distribution, research on sensitivity 
jointly with respect to the prior and the loss is also relevant. The literature on this aspect is not very 
abundant. 


Ll 


12 


13 


Question Bank 


Chapter | 


Prove the following combinatorial identities 


n n n+l 
o + = 
(i) r r-l r |° 
(i) For n>0, [Peco] 
r r 


nj _ Tn+)) _ 1 
(v) r ~T(r+DP(-r+l) rB(n—-r+1,r)° 


o fa b a+b 
(vi) For real numbers a and b, and positive integer n, > k 7 : 
=0 


n-k n 


i=l 


j 
If A,, ..., A, are events such that (4 |> 0 for j = 1, 2, ...., n-1, prove that 


(A A, =P(A,)P(A, |A,)P(A,| A, onic 


A) 
i=l 


i=l 


ij n 
Also, if it A, =0 for some value of j, prove that (A A,| =0. 
i=l 
Suppose that approximately 1/125 of all births are fraternal twins and 1/300 of births are identical 
twins. Elvis Presley had a twin brother (who died at birth). What is the probability that Elvis 
was an identical twin? (You may use the probability of a boy or girl birth as 1/2). 
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14 Let (X, Y) be a random variable of the discrete type with joint pmf 


1/6 
1/6 
1/12 


Consider the random variables U = [|X| and V = Y’. Show that the pmf of (U, V) is 


and marginal pmfs of U and V are 


1/6 u=0 
P(U =u) = 
5/6 u=l, 
and 
5/12. v=l 
P(V =v)= 
7/12 v=4. 


15 Let X,, X,, Y,, Y, be discrete rvs such that the joint pmfs of (X,,X,) and (Y,,Y,) are as follows: 


(i) Show that X, and Y are identically distributed rvs. 
(ii) | Check whether X, and X, are independent. 
(iii) Find the conditional pmf of X,, given X,= 0. 

1.6 Consider the following joint pmf of (X, Y): 
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(i) Find the probabilities P(X+Y <8), P(XY <14), and P(Y =5|X =3). 


(ii) ‘Find the marginal pmf of X. 
(iii) Find the conditional pmf of X, given Y = 6. 
1.7. Show that 


1- re if x 20 
FQx)=}, 
5 OxPX) if x <0 


is a continuous distribution function. Find the corresponding pdf. 


18 If f(x)=e* i(I +e* y , X € (-©0,00), find the distribution function of X. 
19 Suppose X is a rv with pmf 


3 Ei 0 1 5 
02. 200222«02 08s 


find the pmf of the rv Y = X?. 


1.10 If 
3 x 3-x 
Lie if x =0,1,2,... 
f(x) =4| x Jl 6} | 6 


0 otherwise, 


find the pmf of Y=X-1. 
1.11 Suppose the pmf of a rv X is 


29x 


2 ges, 
x! 


find the pmf of Y = nX. 
1.12 Suppose the rv X has a pdf 


f(x) = 


f(x) = 


1 2 
exp(—x’ /2), x € (—©9, 09 
Jon ( I 
find the pdf of Y = | X |. 
1.13. If X has absolutely continuous distribution function F(x), show that Y = — log F(x) has a pdf 


exp(-y) if y20 
0 otherwise. 


y= | 
1.14 A median of X is any value m such that 
P(X <m)2 > and P(X2m)2 > 


which is equivalent to 
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1.16 
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1.18 


1.19 


1.20 


1.21 


1.22 
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P(X <m) <> and P(X> m) <>. 


Show that the set of medians is always a closed interval m, <m<m,. 


Let X be a continuous random variable with pdf f(x). If m is the unique median of the distribution 
of X and b is a real constant, show that 


E[| X-b |] = E[| X-m|]+2 (b—x)f (x)dx , 


provided the expectations exist. 
Let X be a random variable whose pdf is f(x) and suppose that E|X|"< co, where o> 1. Prove that 
E(|X—a|“) is minimized when a is the unique number such that 


J @-x)*f@)dx = f (x-a)*'F(x)dx. 


x<a x>a 


e* if x20 
Let f(x) = 
0 otherwise. 


Find median and mode of f(x). 


(i) If f(x) a : x =1,2,...., find the mode of f(x). 


- ( x€[0,1] 
Gi) ‘If f(x)= 


0 otherwise, 


how many modes are there? 


1 
If MS ox x = 1, 2, 3, ..., then show that its mode is at x = 1, however, for 


f(x) = 12x*(1-x), 0 < x < 1, the mode is at x = 2/3. 


Let (X,, X,) be a two-dimensional discrete rv with probability mass function f(x,,x,) = ae 
+ 
ifx, = 1, 2, ..., x, and x, = 1, 2, ..., k for a given positive integer k. Find f(x,), f(x,), f(x,|x,), f(]x,), 

E(X,|X,), and E(X,). 


If f(x) = x =1,2,.... Show that M,(t)= 5% 
x=l 


tx 
=a a . .Does the infinite series converge for 
2-2. 
TX 1X 
all values of t? What can you say about existence of M,(t)? 


2 


F) 
; (0,0) 
j 


) 0° : 
Let y(t,,t,)=logM(t,,t,). Show that —w(0,0), ——w(0,0); i=1,2, and 
phy phy at, (0,0) an (0,0) at, 


yield the means, the variances, and the covariance of the two random variables. 
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1.23 
1.24 
1.25 


1.26 


1.27 


1.28 


1.29 


1.34 


1.35 


Var(X) = 0 if, and only if, X is a degenerate random variable. 
Suppose P(X 20) =1. Show that ./E(X) > E(vx) ; 


Show that 


f(x)= jen" + ser? ), —00 < KX < 00, 


1 
2V20 
is a proper pdf with modes at x =+]. 
Prove or disprove: 


(i) For a rv X, Fx aan: 
XxX} E(X) 


(ii) | If X and Y are independent rvs then 


Y] E(Y) 
Let f(x) = exp(-x), x > 0. Find E( X | X > 0). 


Let f(x) = ey /2), x € (02,00) . Find 


V2n 
(i) f(x|x 20), and 


(i) | E(X|[X 20) and Var(X|X 20) 
Let X be continuous random variable with pdf f(x) which is positive provided 0<x<b<co and is 
equal to 0 elsewhere. Show that 
b 
E(X)=[ (-F(x))dx, 
0 
where F(x) is the distribution function. 
Show that for any function g 
E(Yg(X)) = E(g(X) E(Y|X)). 
If M,(t) exists at some t#0, show that 
M_() = E(Ee*[Y). 
E(g(Y)h(X)) = E(hCX) E(g(Y)|X)). 
Let g(-,+) be any function of two random variables. Show that 
@) (aX YY = y) = E[g(X, y)IY = y] 
(Gi) E(XY|Y = y) = yE(X|Y =). 
Let (X,, X,) be a two dimensional random variable with joint pdf f(X,, X,). Show that 
Var(X |X, = x,) = E[(K, -EQC |X, = x,)? | X, =x,] 
= E(X,|X, ~ x,) = (EX IX, ~ x)’ 


Show that 
Cov(X, Y) = E(Cov(X, Y|Z)) + Cov (E(X|Z), E(Y|Z)). 
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1.38 


2.1 


2.2 


23 


24 


2.5, 
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Let f(x) = exp(—6)0* /x!, x =0,1,2,...; 8 > 0. Suppose that the value x = 0 cannot be observed. 


Find the mean of the truncated rv. 
Let X° be a random variable such that 


X if |X|SC 
xf = 
0 if |XPC, 
when X° is truncated at C, and all moments of X° exist and are finite. Find E(X‘) and Var(X°‘). 
If X is a random variable with E(X) = uw and Var(X) = 0? and Y = g(X), show that 
1 oy 
E(Y) = g(u) + og"(U) 
and 


Var(Y) = 0° (g’(u))” 
provided all the derivatives of g exist. 
Chapter 2 


Consider the Poisson scheme, in which we have n trials of an event with the probability of 


success p, at the ith trial, i =1,2,...,.n. If X is the number of successes, show that 


E(X) = np 
and 
Var(X) = npg -n oO; 
where 
p = E(p;) and of = Var (p;). 
Show that 
2 n! : arf n k-1 n-k 
» Fah OP =| Epes Oe 
Show that 
5 be =f are dak =1,2, 
s=0 x! » L(k) 


Suppose the mean and variance of Beta(a, 8) distribution are u and 0’, show that 
@  o<p(1-p). 


(i) a=y( MOP) sd p=¢-w( HS) 
o fo} 


Give examples of pairs of values (a, b) for which the Beta(a, b) density is (i) decreasing, (11) 
increasing, (iii) increasing for 8 < 0, and decreasing for 0 > 0,, and (iv) decreasing for 0 < 9, 
and increasing for 0 > 0,. 
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2.6 
27 


28 


29 


2.10 


2.11 


2.12 


2.13 


2.14 
2.15 


2.16 


2.17 


In analogy with question (2.5), determine the possible shapes of the Gamma(q, B), a , 6 > 0. 


If X ~ Gamma(o, B), show that Y = = 2BX has Gamma(a, 1/2) (=x). Give the practical 
implication of the result. 


If X ~ Beta(a, B). Let Y= BX . Show that the pdf of Y has Fisher’s 
a(1—X) 
F-distribution with parameters (20, 26), that is, 


ape yo 
B(at,B) (B+ ary)? * 
Let X and Y be iid Bin(n, 8) random variables, find the pmf of 
(i) U=X+¥, 
(ii) V=X-Y 
(ii) Let U= X/(Y+1) and V = Y + 1. Show that 


f(y) = 


— = as. n n vu+v-l 2n+l-v-uv , — 7 — 
P[U =u, V=vl= p (1-p) ;v=1,2,..,n+1; u=0, 1,... 
uv || v-l 
Find the mgf of the density 


nN 1/2 
- 2 

om) gow 2 4 > 0A > 0,x > 0. 

27x 


f(x) = 
Suppose Y=X* possesses an exponential distribution with cdf F(y) = 1-exp(—By), y>0, 0>0 so 
that X has a Weibull distribution with parameters a and f, and cdf 
F(x) = 1-exp(—fx®%), x > 0. If the first and the third quartiles of X are 8 and 15, respectively, find 
a and £. 

Suppose the distribution of Y = logX is N(p, 67). If the mean and standard deviation of X are 
50 and 25, respectively, find ,1 and 0°. 

Let f(x | 8) be Bin(n, 9), n known, and the marginal pdf of 0 is Beta(, B).Find E(X), Var(X), and 
M,(t). 

Suppose f(x|®) is Bin(1, 8) and f(8) is U(O, 1), then what is the E(e™)? 

Suppose f(x|9) is N(0,07) and f(8) is N(m, v). Find the mgf of the marginal density function of 
X. What is the marginal pdf of X? 

Suppose X, the failure time (in hours) of an electronic component, has density 


1 
f(x |0)= So x > 0,6 > 0. The unknown 6 has Inverted-Gamma(1, 0.01) distribution. Calculate 


the (marginal) probability that the component fails before time 200 hours. 
Suppose the random variable X has a Burr distribution with parameters A, a, b having pdf 


abA*x? 

(A Ae oye 
If a~ Gamma(c, A), then show that the marginal pdf of X is 
boatx 1 

At+x? (log(A+x°)+A—loga)™ ° 


f(x|a)= >0 


f(x)= 
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Suppose X ~ Bin(n, 9) and 6 has a Beta (a, 8) prior distribution. Show that, if the marginal pmf 
of X is constant, then it must be the case that am = B = 1. 


X ~ N(,0"),x € (2,00). Suppose T =(a,o-). Find the pdf of the truncated normal 


distribution. 
Let X,, X,, ..., X, be iid N(0, 1) random variables. Suppose that the marginal distribution of 0 is 
N(m, v). Show that the marginal distribution (the prior predictive distribution) for 


X =(X,,X,,...,.X,) is N(mL1+vJ), where | is the nx1 vector of ones, Iis nxn_ identity 


. y. . 
matrix and J=11 is nxn matrix of ones. 


Let Y=(Y,, Y,, ..., Y,) have a Dirichlet distribution with parameters ©,, O,, ..., O,,,- Show that 


k+l 


Yio Beua{ a... a} Also, show that 


i=2 


7 r k+l 
ba ¥,~ Bev at, >) a, wher pel. 


Let X,, X,, ..., X,,, be independent random variables each having a Gamma(q, 1)distribution. 
Define 


and 


k+l 


Yuu = »y X,. 
i=l 


Show that the joint pdf of Y,, Y, ..., Y, is Dirichlet distribution with parameters ,, 0, ..., 


ei Oisg: 
Show that the distribution of c log Y, where Y ~ Gamma(q, {), belongs to the exponential family. 
In terms of distance measure d(f,g) which of the two densities g,(x)=(6m)'e* ® or g,(x) = 0.5e™ 


is closer to (2m), where d(f, g) = f f cyog{ 2 Jo is Kullback-Leibler divergence measure. 
a glx 


Show that Gamma(, ©) distribution belongs to the one-parameter exponential family. 


Chapter 3 


Why is the book ‘Ars Conjectandi’ by James Bernoulli of particular interest in the study of 
Bayesian statistical inference? 
Distinguish between direct and inverse probability. 


Question Bank 367 


33 


34 
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3.6 


3.7 


3.8 


3.9 


3.10 


3.12 


What did Rev. Thomas Bayes meant by the phrase ‘Absolutely know nothing’ about the 
parameter 0? 

What is meant by “likelihood principle”? 

An urn contains N balls of which b are black. A sample of n balls is drawn from the urn. Let 
A, be the event that the sample contains exactly k blacks balls and B be the event that the jth 
ball is black. Show that P(BIA,) = k/n when the sample is drawn without replacement.What can 
you say about this probability if the sampling was done with replacement? 

In a certain course, students compete independently. There is a 5% chance of getting 75% marks 
or above, 20% chance of getting marks between 60% and 75%, 30% chance of getting marks 
between 45% and 60%, 15% chance of getting marks between 36% and 45%, and otherwise of 
failing the course. Suppose that 30 students appear in the examination. What is the probability 
that there will be equal number of candidates assigned to each of the 5 categories at the end 
of the course? 

Let X ~ N(@, 1) and g(8) = c, an arbitrary constant. Show that posterior distribution of 8, given 
x, is N(x, 1). 

Suppose that X~N(0, 0”) and © = {1, 2} 


5 0.5 if@=1 
B05 if 0=2, 


(i) If o = 2, obtain the marginal probability density for X and sketch it. 

(ii) Find P(®=1|x=1), again supposing o = 2. 

(iii) | Describe how posterior density of @ changes in shape as o increases from 0.1 to 10. 

Suppose that in a large population of voters, the proportion P who belong to the RJD party is 

unknown and suppose that the prior distribution of P is Beta(1, 10). 

(i) If, in a random sample of 1000 voters, it is found that 123 belong to the RJD party, what 
is the posterior distribution of P? 

(ii) | Suppose that instead of taking a random sample as in part(a), voters are selected one at 
a time until exactly 123 have been found who belong to the RJD party. Suppose that a 
total of 1000 voters had to be selected in order to accomplish this, what is the posterior 
distribution of P? 


Let X ~ N(@, 07), 6? known, and g(®) =1,,..,(8). Obtain the posterior density of 6 and find 


its mean. 

Let X ~ N(®, 1) and g(8) = exp[-{0|/2]. Obtain the posterior distribution of 8 and show that as 
X— too, the posterior distribution tends to N(x + sign(x), 1). 

Suppose crossing two particular hybrid petuneas leads to plants with one of four different type 
of flowers. 

(1) —_Red-serrated edge. 

(2) | Red-smooth edge 

(3) White-serrated edge 

(4) | White-smooth edge 

i.e., each plant from this cross is a multinomial trial with four results. If the separate plants are 
independent and p, = 0.12, p, = 0.28, p, = 0.18 and p, = 0.42 for each and three plants are 
produced from crossing the two hybrid petuneas. If Y. denotes the number of plants produced 
of type j G = 1, 2, 3, 4), find the probability P(Y,= 2, Y= 1, Y,= Y,= 0) and also the 
PY =YH=YHLY Ho. 
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If g(8|x) is a posterior distribution associated with f(x|6) and a prior g(8). Show that 


g(9| x) 
f(x | 6) 


(i) Deduce that, if f(x|6) belongs to an exponential family, the posterior distribution also 
belongs to an exponential family, whatever g is. 

(it) | Show that if g(8|x) belongs to an exponential family, the same holds for f(x|@). 

Given a proper distribution g(8) and a sampling distribution f(x|®), show that the only case such 

that g(6|x) and g(@) are identical occurs when f(x|@) does not depend on 0. 

Discuss the statement, ‘All men may know the works of God, and through these works know 

God, but only men of great faith know God directly?’ 

Discuss the statement, ‘People tend to believe results that support their preconceptions and 

disbelieve results that surprise them. Bayesian methods encourage this undisciplined mode of 

thinking.’ 

What is Stigler’s law of Eponymy? Illustrate it with some examples. 


= k(x)g(8) 


Chapter 4 


Suppose that two families G, and G, of prior distributions are both closed under sampling from 
f(x|0). Show that G, UG, and G, NG, are also closed under sampling from f(x|6). 


(i) Show that if a family of prior densities with members g(8|%), ae A, is closed under 
sampling for a likelihood ((6|x), then the family of prior densities whose members are 


g(8|0,A) = >) A,g(8|a,), a,¢ A, 1Si<n, A,>0, )) A, =1 
i=l i=l 


for @ =(O,,0l,,...,0,), A=(A,,A,,-..,4,), 1S also closed under sampling for a likelihood 


£(8| x). 


(iit) | Suppose prior density of the probability @ of a coin coming up heads on any toss is given 
by 


2(0) = 5[e(9/1,1)+ (0) 0,0)}, 


where g(8 | o, B) is a Beta (a, B) density, o is a large integer. Find the posterior density 
g (8 | x) of 0, given x heads in n independent tosses of a coin, and show that it can be 
written in the form 


g(8| x) =Ag(O|a,)+U-A)g(O|a,); A>0, a=(a,B), 


Show, in particular, if @ 22 and x = 0, as the number n of tosses increases, A tends 
to unity. 

Suppose X ~ N(O, 1) and g = (0.9)g, + (0.1)g, where g, is N(O, 2) and g, is N(0,10). 

(i) If x = 1 is observed, find g(0|x) and the posterior mean and variance. 

(i) Ifx=7 is observed, find g(6|x) and the posterior mean and variance. Compare it with (a). 
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4A 


45 


46 


47 


48 


If X ~ N(O, 67), 6? known. 
g(8) = ag, (8) + (1—a)g, (8); 0<as<l 

where 
g,(0) =N(,, 07), 1=1,2. 

Derive the posterior distribution of 8, given x, and show that it is of the form 
g(8| x) =Bg,(8|x)+(—-B)g,(6|x);0<B <1. 


1 
X ~ N(O, 1) and g(0)= zis |-u,1)+ g(8|u,1)] where g(0 |m, 6” ) is the N(m, 0” ) distribution. 


Find g(8|x) and show that the posterior variance V of 0 satisfies 


2 
vetig th 
a; 3 


with equality if, and only if, x = 0. 
If you had used a normal prior density with the same mean and variance as that of g(8), how 
would your posterior density of @ vary from the one above? 
Let a, B,, a, B,, m and n be positive integers. Show that if you use a prior 
g(0) = (1-c) Beta(a,, B,) + c Beta (a,, B,) and the coin is tossed m+n times, producing m heads 
and n tails, then posterior density for 0 is given by 
(1-c,)Beta(a, + m,B, +n)+c,Beta(a, +m,B, +n) 
where 
c, = c/{A(I-c)*c} 
log A = log B(a, +m,B, +n) —log B(a,, 8, )—log B(a, +m, 8, +n)+ log B(a,,B,). 
Hence show that the posterior mean of 0 is 
0.=(V, AC-)+V,c) / (A(1-*) + c) 
where 


If A is very small, then show that the derivative 
do, _ A(V, — V,) 
de (A+c(1-A))’ 


is huge at c = 0 and becomes very small very rapidly as c moves away from 0. 


If f(x |0) =exp [-(x- 9) ]Io,.) (x),x€ R. For a random sample X,, X,, ..., X,, show that x,,. is 
sufficient statistic for 8. Hence construct conjugate prior density of 0. 
Let X =(X,,X,,...,.X,) denote a random sample from N(6, 6’), 6? known. Assume @ has 


N(6,, 6,”) prior distribution. Obtain the posterior distribution of 8, given X. 


(i) Discuss the posterior distribution when o, is is small relative to o, = Vo°/n. When 
would such a prior on 0 be used for such a value of 6,? 
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(ii) | Discuss the posterior distribution where 6, is large relative to O, . What does the use 


of prior with such a value of 6, imply about 0? 

(iii) | Find the limiting posterior variance as neo. What does this imply about the posterior 
distribution of 0? 

(tv) Show that the influence of the prior distribution on the posterior distribution disappears 
as Neo, 


(v) If 6 is the posterior mean of 0, show that 


Var(0 |X) < Var(X) and Var(6 |X) < Var(8). 
Let X be the number of heads in n tosses of a coin, whose probability of head is 0. 
(i) If 8 ~ Beta(a, 8), show that the posterior mean of 8 always lies between prior mean o/ 
(+B) and the observed relative frequency of heads X/n. 
(ii) If ~ U(0, 1), show that the posterior variance of 0 is always less than the prior variance. 
(iii) | Give an example of a Beta(o, 8) prior distribution and data (x, n) in which the posterior 
variance of @ is larger than the prior variance. 
Suppose that X = (X,, ..., X,) is a random sample from NBin(m, 0) and that 0 has Beta(a, B) 
distribution. Show that the posterior distribution of 8, given x, is Beta(a+mn, 2x, +6). 
X ~ N(O, 6”), 8 known. Show that the conjugate prior of 0” is Inverted - Gamma distribution. If 


v.05 tnv 


1 2 
show that g(o*| x,, ...,X,) is oie +n, , V=—X(x, - 8)". 
n 


v,tn 
Suppose that X ~ Pois(®) and 8 ~ Gamma(a, 8). Find the posterior mean of the sampling 
probability 
-~0nj 
PX = jJ==", (20 12:c. 
j! 


Show that the conjugate prior for 8 when 

(i) X ~ Inverse Gaussian(0, 0/0), © known, 0 > 0 

(ii) X~N(O,8),0>0 

belongs to family of Generalised Inverse Gaussian distribution given by 


g(8) =c 0? exo [20+] preal;a,b>0, 


where 


ptl 


c= {z) K,.: (2Vab ) 


and K, (+) is modified Bessel function of order v. 
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If X ~ N(O, 6’), 8>0. Construct the conjugate prior for 0. 

Suppose X ~ Bin(N, 8), 6 known. Find a prior distribution on N such that posterior distribution 
of N is NBin(x, 8). 

Suppose that X = (X,, X,, ..., X,) is a random sample from an exponential distribution 


f(x | 0) aoe yx >0,0>0. 


If the prior distribution of 6 is Inverted-Gamma(c, B) 


sp poo 


‘ce 
g(8| 0B) Tp" 0 eof OB 


then show that the posterior distribution of 0, given X, is 


-1 
lent Gam + nf, af 5] | 


Let X,, X,, ..., X, be a random sample of size n from an exponential distribution with parameter 

0, mean wt = 1/0. Suppose that the prior distribution of 6 is Gamma(a, 8). When will the posterior 

variance of 8 be less than the prior variance? 

Suppose your prior distribution for 0, the proportion of marxists who support the death penalty, 

is a Beta(a, 6) with mean 0.6 and standard deviation 0.3. 

(i) Determine « and § of the prior distribution. Sketch the prior density. 

(i) | A random sample of 1000 marxists is taken, and 65% support the death penalty. What is 
the posterior mean and variance of 0? 

Suppose X,, X,, ..., X, is a random sample from a normal distribution with mean 0 and unknown 

precision 9. If the prior distribution for 8 is gamma such that the posterior distribution of @ has 

coefficient of variation 0.1, then show that sample size n must be 200. 

The length of life of a lamp manufactured by a certain process has an exponential distribution 

with an unknown parameter 0. Suppose that the prior distribution of 6 is a gamma distribution 

for which the coefficient of variation is 0.5. A random sample of lamps is tested, and the length 

of life of each of the lamp is noted. If the coefficient of variation of the posterior distribution 

of 8 must be reduced to the value 0.1, show that 96 lamps should be tested. 

The length of time for which a certain person must wait each morning for a bus taking him to 

work has U(0, 8) distribution, where the value of 8 is unknown and the prior distribution of 6 

is a Pareto with parameters a > 0, b = 10. On how many mornings he must observe his waiting 

time before he will be able to specify an interval having a length of 0.01 unit such that the 

probability that the unknown value of log@ lies in this interval is at least 0.95? 

(J. Q. Smith, 1988) 

Elementary particles are emitted independently from a nuclear source. If X, denotes the time 

before the first emission, in minutes, and X,denotes the time between the emission of the 

(i-1)th and ith particle, i = 2, 3, ..., we know that the density f,(x|0) of each X, is given by 


Oe if x>0,0>0 
0 otherwise. 


£6010)=| 
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When you first obtain the data you find that you have only been given the number Y, of 
emissions in the time interval (j—1, j). You know, however, that from above Y;s j=1, 2, .... are 
independent with mass function 
y,-0 

£,(y|®) = “—y =0,1,2,...;0>0. 
You take m observations and your last observation arrives exactly on the nth minute. If the prior 
distribution for 8 was a Gamma(q, 8) distribution, 
find your posterior distribution 
(i) USING Y,, V5 oy Vis 
(ii) using X,, X,, -. X,5 
and show that the two analyses give identical inferences on 9. 
If X ~ Beta(a, 8) with known a, 
(i) show that the distribution belongs to the exponential family, and 
(it) | obtain the conjugate prior distribution of 0. 
Repeat Question 23 for X ~ Gamma(0, £). 
Let f be the natural conjugate family of an exponential family 

f(x|8) = h(x) exp(@x—(6)). 
Show that the conjugate family for f(x|@) is given by 

g(8|A)=KAye", 


and the posterior distribution is g(0|u+x,A+1) . Show that the set of mixtures of N conjugate 
distribution 


6-|¥ w0lA,.H)s>. w, =hw, >o}, 


i=l i=l 


N 
is also a conjugate family. If g(6) = »y w,g(8|A,,u,) then 


i=l 


g(8| x) = » w;(x)g(0| A; +14; +x) 


i=l 


with 


ga 


wK(u,,A,)/K(u, +x,A; +1) 


M 


Determine whether the following distributions are possible posterior distribution 

(i) Students t(k, u(x), 77(x)), where X ~ N(8, 6?) and o? is known. 

(ii) A truncated normal distribution N(u(x), 77(x)) when X ~ Pois(@). 

(iti) Pareto (a(x), w(x)), when X ~ Bin(n, 1/8), n known. 

Suppose X = (X,, X,, ..., X,) is a random sample from N(8, 6”) distribution where both @ and o* 
are unknown. The prior density of 6 and o? is g(8, 0”) = g,(0 | 0’) g,(o’) where g(0 | 0”) is 
N(u, 07/m) and g,(6’) is Inverted-Gamma(q, f) density. 
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4.28 


4.29 


i ow that the joint posterior distribution of 0 and 0? given X is 
Show that the joint post distribut f 0 and o? g x 


g(8,0° |x) =g,(8|0",x)g,(0" | x) 


where 
o 
g,(8|6°,x) is N{ nes 
n+m 
and 
= ee 

g,(o° |x) is an Inverted Gammel +". 

where 


_mut+nx ,, (1, 1 ee mn(xX —)° . 
Mx) = n+m ” P [pte a 1 i 


(it) | Show that the marginal posterior density of 8, given X, is a 3-parameter t-density with 


he 
(2a + n —1) df, location parameter U(x), and precision G + nl 2 + aL } 


1 
(ili) 2(8, O°) = Gt hon (o") , find £,(8| O°, x), 22 (O° |x) and g,(0|x). 
Let X,, X,, .... X, be a random sample from N(0,, 0,). Assume that 0, and 0, are a-priori 
independent with respective densities 


£,(0,)= (2nw) exp] 5-0 -m)*| 


= ~(0+1) (_QQ-l 
8,(8,)= 79, (-B0;). 


Show that the marginal posterior density g(0, |x) of 0, satisfies 
g(9, | x) =h, (6, )h, (8;) 
where h,(8,) is a normal density and h,(6,) is a t-density. 
Suppose that X,, X,, ..., X, are independent Pareto(u, t) distributed random variables. 
(i) Show that m = min(x,, x,,..., X,) and v= log 2s are sufficient statistics for [, T. 


Further show that m and v are independently distributed as Pareto(u, nt) and 
Gamma (n—1, T), respectively. 
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(ii) For prior distribution of Pareto-Gamma type, written Pareto-Gamma(b, c, g, h) with density 


function 
ctu | h&t?'e 
g(U,T) = — 0<p<b,t20, 
b T(g) 
show that the posterior distribution is also Pareto-Gamma distribution. 
(Kahn, 1987) 


Let X =(X,,X,,...,X,) be iid Bin(n, 8) random variables. If the joint prior distribution of n and 


0 is such that g(n, 0) = g,(n) g,(0) and if the marginal prior for 0 is a Beta(a, b) distribution then 
show that the marginal posterior distribution of n, is 
T(kn-t+b) 7 T(n+1) 


&:(n|x)= Tdntarbyll Tax, +2» 


Me 


for n2Xq),t= ), x;, where x,, is the kth-order statistic. 


i=l 


Let X,, X,, -. 


, X, be n independent observations from either N(O,, 1) or N(O,, 1). Inference is 
required about 9 


Q 


=(0,,0,) and in particular about 6 = 6 —8,. Suppose that, a-priori, 8, and 0, 


are independently distributed as N(0, 6,7) and N(0, ,”), respectively. If we take m observations 

from N(0,, 1) and (n—m) observations from N(@, ,1) then show that posterior distribution of 0, 

and 0, are independent with variances, 
2 


o 
V(0, |x)= 1 and Var(0, | x) = ———*— 
ca 1+mo; (821%) 1+(n—m)o; ° 


O; 


Further show that 
V(6| x) = Var(8, | x) + Var(0, | x) 
is minimised for 
DA 
2m, =n+ esc) 
0; 0, 


The optimum value of m is the next integer above or below m, 
Consider a uniform distribution on the interval (0,, 8,), where the values of 0, and 0, are 
unknown, and suppose that the joint prior distribution of 0, and 9, is a bilateral bivariate Pareto 
distribution with density 
r(r+1)(b-—a)r(0,-0,) "7 (0,<a<b<O,) 

0 otherwise. 


£(0,,8, |€,n,1) -| 


Show that the joint posterior pdf of 0,, 0, is 
2(6,,0, | X) x (8, - aye Toyo) (8, Ia" (9,) 
where 


r’=r+n,b’=max(b,x,,,),a’ = min(a,xq)), X =(X,,X25--5X,). 


Question Bank 375 


4.33 


5.1 


52. 


53 


54 


5.5 


5.6 


5.7 


5.8 


59 


If r = 2, how large a random sample must be taken from the uniform distribution in order that 
the coefficient of variation of the posterior distribution of 0,—0, will be reduced to 0.01? 


Consider a bivariate normal distribution with an unknown mean vector @=(6,,0,)’ and a 


known covariance matrix 2. Suppose that the prior distribution of @ is a BVN(U,2,) . How 


large a random sample must be taken in order that the variance of the posterior distribution of 


1 2 
the random variable 6,—0, will be reduced to the value 0.01 when & -|; i and 
4 3 
Y= ? 


Chapter 5 
Explain Laplace’s principle of insufficient reason. 
Show that, if 6¢ [0,1], = and if g(@) => the prior distribution g(®) is Haldane’s 


distribution. 
(Robert, 2001) 
Assuming that g(0) = | is an acceptable prior for a real parameter 0, show that this generalised 
prior leads to g(o) = 1/o if o > 0 and g(p) = 1/(p(1-p)) if pe [0, 1] by considering the natural 
transformations 8 = logo and 0 = log(p/(1—p)), respectively. 
(O’Hagan, 1994) 
Let X be the number of successes in n independent trials with probability c@ of success in each 
trial, where c €[0, 1] is a known constant. Show the the Jeffreys’ non-informative prior 
distribution for 0 is 

(0) 6-!2(1-c0)-12, 
Let X ~ Pois(8). Find Jeffreys’ prior density for 6 and then find a and B for which Gamma(ct, 
6B) density is a close match to Jeffreys’ density. 
Let Y,, Y,,.... Y, denote a random sample from the exponential distribution with density 8e™ for 
0<x<6 and with mean A=1/0. Find (i) Jeffreys’ prior for 6 and (ii) Jeffreys’ prior for A. Show that 
both lead to the same posterior distribution. 
If X ~ N(O, 8), 6>0. Determine the Jeffreys’ prior. 
Consider X,, X,, ..., X,, iid N(0, 6) with © > 0, which represents ten observations of the spiral 
of a star. Justify the choice g(0) =1/0. 
Let X ~ N(O, 1) and the one-to-one transform @ = sinh@. 
(i) When g(o) = 1, show that the resulting posterior distribution on 0 is 

g(8| x) e*N(x+1,1)+e*N(x-1,1). 

(ii) | Compare the behaviour of this posterior distribution with the usual posterior N(x, 1) based 

on Jeffreys’ prior in terms of posterior variance, posterior quantiles and modes. In 


particular, determine the value of x for which the posterior distribution is biomodal and 
those for which there are two global maxima. 


376 


5.10 


5.11 


5.12 


5.13 


5.14 


5.15 


5.16 


5.17 


5.18 


Bayesian Parametric Inference 


(iii) | Consider the behaviour of g(0|x) for large values of x and conclude that the prior 
g(o) = | is unreasonable. 
Let f(x|0) = h(x) exp(Ox—y(0)); 0 = (6, ..., 0,). Show that Jeffreys’ prior for 0 is 


k 1/2 8 
a 8; 
Hence obtain g(@) when X ~ Bin (n, 9), n known. 
Suppose that X has Bin(n,0). If 8 has the improper prior density g(8) <1/0(1—-8), show that 
posterior density of 8, given x, is proper provided 0<x<n. 
Let X,, X,, ..., X, be independently distributed rvs such that X, ~ N(O,, 1) with 
0, ~ N(u, 0”) (1<j<n) and the joint prior distribution of ("1, 67) is g(u, a) =1/o0’. Show that 
alu, O'|x,, X,, ..., X,) is not defined. 
Suppose X ~ Bint, 0), n and 9 both unknown. If n and 6 are a-priori independent and g ,(0)=1 
for 0€ [0,1] and g,(n) 1 for all n=1,2,..., then show that marginal posterior distributions of n 
and 0 are g,(n|x) «1/(n+1), n=1,2,... and g,(8|x) 1/0, respectively. What do you learn from it? 
(Press, 1972) 
Suppose X~MVN(O,r). Show that the Jeffreys’ non-informative prior for r is 


g(r) 


(8); 1= wok, 


cea 


n 
Let f(x | 6) -( r (1—6)"*; x =0, 1, ..., n; 0<O<1. Obtain Hartigan’s ALI prior for 8. How does 
X 


it differ from the Bayes-Laplace uniform prior for 8? Compare it with Jeffreys’ non-informative 
prior. 


Let f(x|9)= 5o7(- a} x,0>0. Show that Jeffreys’ prior and ALI prior for 0 is g(0) c 1/0. 


Also construct the conjugate prior for 0. 
Rayleigh pdf is 


f(x | 6) =— aes 56 >} s0>0. 


Show that ALI prior for 0 is g(0) << 1/0°. 
Let V, and V, be Var(0"|x) for the Rayleigh pdf under Jeffreys’ and ALI priors g,(8) and g,(0), 
respectively. Show that V, > V,. 
Let V, and V, are posterior variances of 0 under Jeffreys’ and ALI priors, respectively. 
(i) If X ~ Bin(n, 8), show that 
7 4n*(n+1)V, +2n+1 

4(n +1)? (n +2) 

(ii) — If X ~ Pois(®). Show that V,< V,. 
Suppose the rv X has a modified power series distribution with pmf 


1 


F(x |) = OER xe 
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5.22 


5.23 


5.24 


5.25 


5.26 


5.27 


where T is a subset of non-negative integers, a(x)> 0, u(®@) and v(@) are positive, finite and 
differentiable functions of 8. Then the ALI prior for the MPSD is 


(0) <2), 
u(@) 
(a) | Hence find ALI priors for 8 when 
(i) X ~ Bin(n, 8), n known. 
(ii) X ~ Pois(6), 
(iti) X follows generalised negative binomial distribution having pmf 


f(x | 0) « 6*(1-0)"***; x =0, 1, ..., 0<O<1, |OB|<1, B = 0 or B21, n> 0. 
(b) Show that Jeffreys’ prior for 0 is 


2 1/2 
| a J wu a wr 
vut+vu Vv uv 

(8) = Seer 
vu Vv uv 


(c) Show that the Poisson distribution is a member of MPSD class of distributions. Hence 
obtain (i) the Jeffreys’ prior and (ii) Hartingan’s ALI prior for the parameter 0. 

Let X ~ Pois(®). Find Jaynes’ maximum entropy prior associated with g,(8) where g,(0) is 

Jeffreys’ non-informative prior for 8 and E(@) = 2. 


m-l 


1 x 
B(m,n) (1+ x)™* 


Show that the maximum entropy pdf is f(x) = ~, when E(logx) and E(log(1+x)) 


are known. 
Find the maximum entropy pdf when both the arithmetic and geometric means are known. 


Show that the maximum entropy distribution is Laplace’s distribution f(x) = 50-2) 
0 o 


where E[|X|] is prescribed. 


M M 
Let H = J p(u) log p(u)du . Show that H is minimized, subject to | p(uw)du =1 when 
—M —-M 


1 
= —M< <M. 
p(u) aM Bh 


Show that the maximum entropy pdf is multivariate normal with mean M and variance 


covariance matrix 2, when E(X,) = m,, var (X,) = 6/1 and Cov(X,,. X) =P, 0.0, for i,j = 1, 2, ...n. 
(Zellner, 1978) 
If the random variable X has a proper normalised pdf 


(i) f (x | 0) =f (x-6) with a<@<b,-co<x <co, Then the MDIP for 0 is uniform. 
tia 
8 


the MDIP pdf’s for 8 and 0, are given by g'(0) = c/0, aS O9<b and g(0,) =c¢,/0,, ka<O,<kb, 
where c and c, are normalizing constants. 


1 
(ii) rol0= 5" } 0<a<0<band —«<x<oo, and 0, = kO, where k > 0, show that 
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If X is Pareto (0, 8) with pdf 


B 
£(0|0,B)=Boyr30 <O<x<-,B>0. 


B 


Discuss the criticism “Bayesianism assumes: (a) Either a weak or uniform prior distribution, in 
which case why better? (b) or a strong prior, in which case why collect new data? (c) or more 
realistically, something in between, in which case Bayesianism always seems to duck the issue.” 


1 
where @ is known, show that MDIP for f is, 2,(B) < B cx 5 : 


Chapter 6 


A company has to decide whether to accept or reject a lot of incoming parts. (Label actions a, 
and a,, respectively). The lots are of three types: 0, (very good), 8, (acceptable), and 0, (bad). 
The loss ¢(0,, a), i= 1, 2, 3; = 1, 2 incurred in making the decision is given in the following 
table: 


The prior belief is that g(0,) = g(8,) = g(0,) = 1/3. Determine the Bayes action. 

An insurance company is faced with taking one of the following 3 actions: 
a, = increase sales force by 10%, a, = maintain present sales force, a, = decrease sales force by 
10%. Depending on whether the economy is good(0,), mediocre(0,), or bad (0,), the company 
would expect to lose the following amounts of money in each case: 


Action Taken 


State 
of 
Economy 


The company believes that 6 has the probability distribution 

g(8,) = 0.2, g(0,) = 0.3, g(0,) =0.5. 
Order the actions according to their Bayes risks and state the Bayes action. 
Consider the class of decisions D = {d,, d,} and loss function L(0, d,) = 0.5+0, L(0, d,) = 2-0. 
Give the optimal prior decisions when the prior distribution of @ is (1) Beta(1, 1) and 
(ii) Beta(2, 2). 
Each time a tornado alert is given, a certain community has a choice of three possible decisions: 
a,(no mobilization), a,(partial mobilization), a,(full mobilization). Tornado severity is measured on 
a scale 8 with three possible values 0, 0.5, 1. The alert specifies a predicted severity X that is a 
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rv for which P,[-X = 0] = 1-0, P,[X = 0.5] = 0/2 = P,(X=1). It is known that in this area tornadoes 


of severity 0 occur 50% of the time, and tornadoes of severity 0.5 and | occurring 25% of the 
time each. The loss incurred by the community by choosing decision a, when 0 is true, is as 
given in the following table: 

L(@, a) 


Formulate the problem of deciding on an action as a statistical decision problem (0, D, L), X 
and prior distribution on 6. (Specify each of ©, D, L, X, g(8) carefully and precisely). What is 
the Bayes’ decision if we observe X = 0? 
State “Bayes Principle” for obtaining a best decision rule. Let X ~ N(0,1), L(®,a) = (0-a)* and 
the decision rule is 5,(x) = cx. If the prior distribution g(8) of 8 is N(0, 7). Calculate the Bayes 
risk of 5. and hence the Bayes decision rule. Show that Bayes risk of g is t’/(1+t”). Also show 
that decision rule 6,, that is for c = 1, is better than any 6, for c>1. 
(i) If X is Bin(n, 8), then show that 

k 


=] - 
ELX-9 = Jaane for ie O<-. 
n k-1 n n 


(ii) | Graph the risk function of part (i) for n = 4 and n=5. 


: . {1 n-l n-l 
[us the identity Joao =of(2o Jo-o-{ x jphas xs | 


In a statistical decision problem with loss function L(0, a), consider the alternative loss function. 
L*(0, a) = cL(8, a) + g(0), where c is a positive constant and g(@) is any bounded function of 
0. Prove that L(0, a) and L’(0, a) lead to the same Bayes estimate. 

Assume X ~ U(0, 8) is observed and it is desired to estimate 8 under squared error loss. 
Consider two priors for 8 


g,(8) = BO exp(-O/B)I.o,..) (0), 


and 


7 0) 
g, (0) =6a (ied Tio...) (9) , 


The median of the prior is felt to be 6. 

(i) Determine o and B. 

(ii) | Calculate the Bayes estimates of 8 with respect to g, and g,. 
(iii) | Calculate R(0, 5) for each of the Bayes estimates. 

(tv) Which of the two rules is more appealing for large 6? 
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Consider X ~ Bin(n, 9) with n known. If prior of @ is Beta(a, b) distribution, under SELF, for what 

values of a and b, the Bayes risk is constant. 

If X ~ N(O, 1), 8 ~ N(O, n), and the loss is squared error then show that the Bayes risk is n/ 

(nt+1). 

Let X ~ N(O, 1) and 0 ig a measure of some positive quantity. 

(i) If prior g(0) = 1,_,(0), and the loss function is L(0, a) = (8—-a)’, show that the Bayes 
estimate of 0 is 


xe fou[ = +} ame —-x), where z ~ N(0, 1). 


(i) If g(®) is Gamma(1, 1), the Bayes estimate of 0 is (x-1/n) , provided x >1/n. 


Obtain Bayes estimates of 4 and o? in the lognormal distribution when the joint prior 
g(U, 6) x 1/o°, c > 0, and the loss function is SELF. 
If X,, X,, ..., X, is a random sample from Lognormal (0, 6’) distribution with o* known and if the 


prior for 0 is g(®) = exp (-0), 0 <@<o, show that the Bayes estimate of 8 under SELF is 


o 1 = Tz 
exp| z +—/ 1-— |], Z=— logx.. 
P| | *|| qa Mex 


Assume X = (X,, X,, .... X,) is a random sample from Pareto(A, o) distribution. Suppose A is 


given and the prior distribution of & is g(@) « 126 <@Q<oo, Show that the Bayes estimate, 
104 


l/n 
under SELF, of & is 1/log(G/A), where G = 1 s | is the geometric mean. 


i=l 


Let X = (X,, X,, ..., X_) be a random sample from Rayleigh pdf 
1 2 n 


x? 
f(x | 0) = — 6>0. 
(x | 8) ae fs 


Construct the natural conjugate family of priors for 8. Hence obtain the Bayes estimate of 0 
under SELF. 
Consider X,, X, tid with distribution 


f(x|®) = exp[-| x-0]/2, XE (-°°, co), and g(0) = 1,0 € (—ee, °°). 
Determine the Bayes estimates associated with the squared and absolute error losses. Also 
obtain Bayes estimate when the random sample X,, X,, X, is observed. 


6\(/M-9 M : 
Suppose f(x|9)= ;x = max (0,0+n-—M).,...,min(6,n), 
x || n-x n 

and the prior distribution of 8 is beta-binomial 


2(0 | 1,8) = : eee = 0,1,2,...n 
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(M+o+B)x +o(M-—n) 
n+at+f ; 


Show that the Bayes estimate of 8, under SELF, is 
Suppose 


f(x | u,8) = Fern] 259 Low X < 09; 


and the joint prior for 1. and 0 is g(u,8) «< 6°',8 > 0. Obtain Bayes estimate of 1 under SELF. 

Let X ~ N(O, 1) and quantity of interest is h(6) = e®. Give the risk of the Bayes estimate of h(8) 

associated with g(8) = 1 and SELF. 

Suppose X ~ Bin(n, 8). 

(i) Find the Bayes estimate of 0(1—0), under SELF, when 0 has the prior Beta(a, b) 
distribution. 

(ii) | Find the posterior expected loss of the estimate of 6(1—0) when @ has the Jeffreys’ prior. 

In an estimation problem with 0>0 and loss function L(@, a) = (a — 0)’/0, show that the Bayes 

estimate is [E(07'| x)]"'. Given instead L(0, a) = (a — 0)?/a, prove that the Bayes estimate is 

[E@ xP. 

Let X ~ N(0, r), precision r known. Obtain Bayes estimate of r under each of the above two loss 

functions, when g(r) is Jeffreys’ prior. 


Suppose X =(X,,X,,...,X,) is a random sample from the exponential distribution 


FI = 520 =} x >0. 


(i) Find the conjugate prior distribution of 0 and identify it. 

(i) If the loss function is L(@, a) = (a/@ —1)?, find the Bayes estimate of 0. 

Let X possess a gamma distribution whose mean and variance are both equal to a positive 

integer M. If the prior distribution of M is geometric with pmf 
g(m|@) = (1-0) 6™!, m = 1, 2, ...; where 0<6<1, 

(i) prove that the posterior distribution of M—1, given that X = x, is a Poisson with mean 
b= x0. 

(ii) | Under the loss function L(m,a) =(m-—a)’/m, m=1,2,... , find the Bayes estimate of M 
and show that its posterior expected loss is given by 


E(M | x)—(E(M™ | x)) =1-y) em, 


Suppose X,, X,, ..., X_ is a random sample from 
f x 


Consider the prior for 8 as Inverted-Gamma(a, (6). Show that the Bayes estimate of 6 under 


L(0,a) -(5-1| is [25+ 3x Jr +2ae2y" 


n x 


Xie 6 ex SO. 


n 2] _ 1 
2 rf Jeo" 


= 
2 
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Let X ~ N(O, 1), given 6 and let 6 ~ N(0, 1). If L(®, a) = (a-0)* exp(307/4). Prove that the Bayes 
estimate is d(x) = 2x, but this has uniformly higher risk than t(x) = x. Obtain the expected risk. 
What do you infer? 

Let X ~ Bernoulli(®); x = 0, 1; 0<@<1, and 9 ~ Beta(a, 6). If the loss function 


L(0,6) = 6* '(1-6)""* (8-6), obtain the Bayes estimate of 0. 
Let X ~ Pois(8), 8 > 0 and 8 ~ Gamma(a, 6). Find the Bayes estimate 6 of 6 under the loss 


L(0,6) = 8% Ye-" (9 - 6)?; a>1,b>0. 
If X ~ N(O, 1), and 8 ~ N(O, 1). If the loss function is 


L(0,6) = o[ 00-5 Joy 


Find the Bayes estimate of 0. 


1 - 
If f(x |@) 500 F 0>0, x>0. Take g(@) as Inverted-Gamma distribution and the loss 
function as L(0,6) =e */°6"?(@—6)° . Find the Bayes estimate of 0. 
Under the general loss function 

L(®, a) = (8) (a™-8")°; m > 0, w(8) > 0 for all 9, 
show that the Bayes estimate of 8 is 


E(o(8) | x) 


(Barnett, 1982) 

A radio telescope receives signals of two distinct types, coded 0 and 1, independently with 
probabilities 1-@ and 0, respectively. The radio telescope is operated until the signal | has 
occurred r times; this happens on the nth signal. We know nothing about 0 initially and choose 


to represent this by U(0, 1). We decide that the loss function is L(®,a) =(a—6) /(0’(1—6)) . 
Show that the estimate of 6 is equal to (r—1)/(n—1) and the Bayes risk is 1/(1—1). 

Suppose X ~ N(0, 6”) and the parameter of interest is e®, when o? is known. 

(i) Among the estimators of the form §,(x) = Piel , determine Bayes estimate of e° for the 


SELF. 
(ii) | Repeat the problem for L(8, a) = |e® —a| when g(0) = e°, 6>0. 


is a solution of the 


Show that the Bayes estimate under the loss function L(0,a) = E- 


equation E(0|x)=2[ 0'g(0|x)d6. 


Question Bank 383 


6.34 Let © = (0, -), e¥= [0, ©), X ~ Pois(8), and let L(0, a) = (0—a)’/0. 
Show that d(x) = x is generalised Bayes rule with respect to g(0) - 1/0; @>0. 
635 Let © =(0, ~%), e¥= [0, -) and let 


+x-l 
real O={" . Jrareresxa02., 


r is some positive integer, and let the loss function be 
L(0,a) = (@—a) /(0(0 +1). 
(i) Show that d(x) = x/r is generalised Bayes rule with respect to prior g(0)<1, 
Be (0, ©), provided r>1. What happens when r=1? 
(ii) | Find Bayes estimate of @ with respect to prior density 
1 
B(a,B) 


g(8| ot, B) = r+ Oy Tg (8). 


6.36 (Ferguson, 1967) 
Let X =(X,,X,,...,X,)(n 22) be a random sample from N(u, 67). Let © be the half plane 


(uU, 6), o>0, and WE (—-9,°0). If the loss function is 
vu+Mo-a)” 
L((u, 0), a) a iOS 


for some given numbers v and @ and the joint prior distribution of tt and o is such that 


2" sn 
g(0) = ra exp| =r [0 Tea), 


and 


g(u|o) = N(0,r°o”), where o, A and r are known. 


(i) | Show that the Bayes estimate of vutwo is d(x) = vfit@6, where 


,. mx ~ T(2a+n+1)/2)( ns nx ‘aa 
UL = —,, and o= + 2 
1+nr T(2a+n+2)/2)| 2 2A(+nr-) 
having the minimum Bayes risk 
(37) 

2.2 
iat | 2a+n |, (2a+n+2) | 

+nr rf ss "| aL ° 


1/2 
(ii) If r3o and a—0, then ae) overor[ "1) "| /(o) which is a 


generalised Bayes rule. 
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6.37. Show that the Bayes risk under the linex loss function, 
L(A) = b(e“ -aA-1),a 40,6 >0,A =0-8, 
is given by 
r(g) = abE[E(0- 6, )] 
where 6, is the Bayes estimate under linex loss. Here, the first expectation is taken with respect 


to g(8) and the second expectation is taken with respect to f(x|6). Hence, show that 
r(g) = abE(0, ~ 0, ) 


where 6, is the Bayes estimate under SELF. 


If X,, X,, ..., X, is a random sample from Pois(®) and the prior density of @ is Gamma(a, 8). Show 
that the Bayes risk of g is 


b a+ dx, 
{3s [++ 9{ 1452] 


B+n B+n 
6.38 Show that the Bayes estimate of 8, under the Quasi-quadratic loss function 


L(6,8) = (or -e% y 
is same as that under linex loss function. 
6.39 (Pandey & Rai, 1992) 
Let X,, X,, ..., X, be a random sample from N(0, 6°), 6? known. 
(i) If the conjugate prior for 8 is N(u, 7”), show that the Bayes estimate of 0 
under linex loss function and SELF are 


2 


~  X+Au ao 
0, = - ; 
1+ = 2n(1-A) 
and 


~  X+A o 
6, = etd , respectively, where A= — 
1+A nt 


(ii) | Show that the Bayes estimate of h(®@) = 0? under linex is 


= 2 2 5 2 
f,(0) = X +AU, ia 2a0 : 1 ioe! 2a0 
1+A n(1+A) 2a n(1+A) 


provided a> =i) , and under SELF is 
2 e 1 o 
h, (0) = oy + : 
2(8) 144 ) n(1+A) 
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6.40 (Srivastava, 1995) 
Let X,, X,, .... X, be a random sample from the generalised gamma distribution with pdf 


ap-l a 
f(x |) =< rol ja on>o.0>0 


Tp) 9 
with known parameters o and p. Suppose the prior distribution of the parameter 0 is inverted- 
gamma having pdf 
b -c/8 
ce 
0) = , 0>0, 
gC ) T(b)e"*" 


and the loss function is modified linex 


L(A) = ble“ —aA-1], where a#0,b>0, andA=—-1, 


o|@m 


Show that the Bayes estimate is 


Pal +e} whe Zz =1-exp] 2b 
np 


a\ i= 


and risk function of @ is 


R(O,) 7 Hou{-( 2) —(np+b)z+a | 
-Z 


641 (Dey et. al., 1998) 
Suppose the rv X has a power series distribution with pmf 
f(x | 8) = u(0)0*a(x), x =0,1,2,...,0>0 
where 


co 


Wo={¥ a(x)0" ) 


x=0 


(i) Find the natural conjugate prior for 8 and hence obtain the posterior distribution of 0. 
Show that the Bayes estimate of 0 under the linex loss function is 


7 ‘ pi. 
-Loe§ K(n, +1,x+j+x,)(-D °] 
c 


7 K(n, +1,x+X,)j! 


and the posterior expected loss evaluated at the Bayes estimate a is 


ioe x K (ny +1,x + J+ XI) a pho th Xt 5+ Xo) | 
{20 K(n, +1,x+x,)J! K(n, +1,x+X,) 
(ii) | Find posterior expected loss of the linex estimate. 


(ii) Illustrate the results of (i) and (ii) when X ~ Pois(6). 
(Refer Example 6.30) 


386 


6.42 


6.43 


6.44 


Bayesian Parametric Inference 
(Bhattacharya, Samaniego and Vestrup, 2002). 


Suppose X follows a k-variate N(0,o°I) distribution and let the prior distribution of © be 


MVNiu, 0,1) . The multivariate extension of linex loss function for the vector 0 and estimate 


L(6,6) = x Ee ~a,(6, -6)-1| 


Show that the Bayes estimate is given by 

on o o,o 

2 Xt 7 — 2 z 
0, +0 O,+0 ~ 2(0°+05)~’ 


where a=(a,,a,,...,a,),a, #0,i=1,2,...,k, and that posterior expected loss for ) is given by 


y [er {ondoXHauntond a, [E{A.X; +A.M; —A3a;)}- 6, ]-1], 


i=l 


where 


2 
ae ee ee = 50%, 


*(eagy ” 


Suppose X ~ Geo(6/(0+1)). Show that the entropy loss function for 0 is 


6 1 atl 
L(6,a) = toa( 2 }+(1+5 hoe{ E**). 


If the prior for 8 is g(0) = 0°, c>0, show that the Bayes estimate of 0 is 

(n-c)(2xtn-1)". 

(Joshi and Shah, 1991) 

Suppose X,, X,, ..., X, (n>3) be a random sample from inverse Gaussian distribution, with known 
coefficient of variation a, having pdf 


wy" (x-py 
rein=| ean - aoe }x>en>oar0 


2na* 


Consider the prior density g(() = cy? on|-fane (a,B)20, p real and c is a constant 


ptl 


2 
of proportionality given by c' = {5 k, (2/aB). 
Qa 


Show that the Bayes estimate of u, under the loss function 
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6.45 


6.46 


Lu.) =+5-2 
hp 


/ 
; S, 1/2 are (2(S,8,)"”) 
Us, | |, @6 8) 


<+ 
7 7P 


where 


2 = 2 
ae ry hee =) ee: nx [2 
n 2a° 


nx 


and k,(-) being the modified Bessel function of order u. 


(El Sayyad and Freeman, 1973) 

Let x be an observation from a Pois(@) distribution and let the prior distribution of 6 be given 
by a g(8) = 1/0, 8>0. Suppose that the decision maker wishes to consider the following five loss 
functions. Show that the corresponding Bayes estimates and their respective posterior expected 
losses (PEL) are as in the following table: 


L,(O, a) = (a— ey 
L,(0, a) = (a — 0)7/0 


L,(®, a) = (a— 8)°/0 1/(x-1) 


L,(8,a) = [oe 5 exp(ww(x)) w(x) =1/x 


L,(0,a) = (va V0 y [Pox +0.5)/P(x) x [P(x +0.5)/P()] 


where (x) = “log T(x) and w’(x) are, respectively, the digamma and trigamma functions. 
x 


Let X ~ N(O, 67), and g(o) << 1/o. Obtain Bayes estimate of o when the loss functions is 


( Lie.6)=[ 1 . 


(ii) L060)-{ 1] 
(eo) 


2 
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6.47 


6.48 


6.49 


6.50 


6.51 


6.52 


6.53 
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18 . 6,-0 
Gi) L,(0,6,) = 222. 


; . 6 6 
(iv) L,(0,6,)=—-log—-1. 
o o 


If X ~ N(O, 1), 8 ~ N(O, 1) and loss function is L(®, a) is 1 if a < 0, otherwise 0. Show that there 
is no Bayes estimator of 0. 

Suppose that X,,X,, ..., X, is a random sample from N(Q9, 1) and that the prior distribution of 8 
is N(O, 1). If the cals of 0 must be estimated when the loss is squared error and if the sampling 


cost is c per observation, show that the optimal number fi of observations is ¥1/c—-1. 
Suppose X ~ N(0,r),0~ N(u,t),L(0,a) =(@—a)’. If the sampling cost is c per observation, 


show that the optimal number fi of observations is 


(DeGroot, 1971) 
Let X ~ Bin(n, 8). 8 ~ Beta(a, 6) and consider SELF then the optimal number fi of observations 


is 
ie oP ~(a+B) 


c(a+B)(a+B-1) 


where c is the sampling cost per observation. 

Suppose that X,,X,, ..., X, is a random sample from an exponential distribution for which the 
parameter 8 is unknown. Suppose that the value of 1/@ has to be estimated when the loss 
function is 


L(8,a) = 5) . 


Assuming that the prior distribution of 8 is Gamma(qa, 6) with o>2, show, if the cost per 


observation is c, that the optimal number of observation fi is given by 
B 
{c(a—1)(a—2)} 


Suppose that X,, X,, ..., X, is a random sample from a Poisson distribution for which the mean 
8 is unknown and suppose that the prior distribution of 8 is Gamma(c, 8). Show that, when 


A= ~(a-), 


o + Xx, >1, a generalised maximum likelihood estimate of 0 is specified by (x; +a -1) / (n+). 


Suppose that X,, X,, ..., X, is a random sample from a Bernoulli distribution for which the 
parameter 9 is unknown, and that the prior distribution of 8 is Beta(a, B). Show that when 


l-a< >» X, <n+fP-1, a generalised maximum likelihood estimate of 0 is 
i=l 
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6.54 


6.55 


6.56 


6.57 


6.58 


6.59 


6.60 


G-[Ss +01] [ora+8-2) 


(Draper and Guttman, 1971) 
Let X,,. X,, .... X, be r independent Bin(n, 9) distributed random variables and 0 is known. Let 
the prior distribution of n be discrete uniform distribution 


1 
n)=—; l<n<N, 
Cae 
where N is a large preselected integer. Show that the posterior distribution for n is proportional 
to is -6)" al Hence obtain the generalised maximum likelihood estimate of n. 
ial (n—x;)! 


in : fis the integer solution of Tox) <(n(—p))’ and Tatts) >{(nt+Dd -»r"] 


i=l i=l 


Assume f(x|0)= os) and g(@)= aa’ Show that the generalised maximum 
likelihood estimate of 0 is x. 

Let X ~ Pois(9). If g(8) is a proper prior distribution of 8, obtain the linear Bayes estimate of 0 
under SELF. Hence obtain the linear Bayes estimator of 8 when its distribution is Gamma(q, §). 
Let X be the number of successes in n trials with probability 8 of success in each trial. Let x 
be an observation and we are estimating 9 by linear function of x. Show that the Bayes linear 
estimator is 


6(x) =~ +(1- 0) E(0), 
n 


where 

a= varioy[ van + “E60 - Dy 
Obtain the linear Bayes estimate of Poisson mean 8 under linex loss function and show that it 
is “(a + lo + au when the prior distribution @ is Gamma(a.,B). 


Construct a 95% HPD interval for the mean of normal distribution when precision is | and the 

observed sample is 1.5, 1.3, 0.2, —2.1, -0.9. Compare it with classical 95% confidence interval for 

0. 

Suppose X is N(@, 1) and that a 90% HPD credible set for 6 is desired. The prior information is 

that 6 has a unimodal density with median 0 and quartiles +1. The observation is x = 6. 

(i) If the prior information is modelled as a N(0, 2.19) distribution, find the 90% HPD credible 
interval. 

(ii) If the prior information is modelled as a Cauchy (0, 1) distribution, find the 90% HPD 
credible interval. 


390 


6.61 


6.62 


6.63 


6.64 


6.65 
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The weekly number X of fires in a town has a Pois(8) distribution. It is desired to find a 90% 
HPD credible interval for 8. Nothing is known a-priori for 8, so the non-informative prior 
g(9) = (1/0), (8) is deemed appropriate. The number of fires observed for five weekly periods 
was 0, 1, 1, 0, 0. What is the desired HPD credible interval? 

Derive the Bayes estimate of 8 when ®@|x ~ N(u(x), 1) and the loss function is the asymmetric 
squared error loss 


«(0-8) if 5<0 


eee ‘ -«)(0-3) otherwise. 


Consider the bounded loss L(0, 6) =1 eg Mg >0, for the estimation of 8 when X ~ N(O, 1). 


Determine the Bayes estimate associated with 

(9) < exp[-A|0-]]. 
Consider the problem of estimating a binomial probability 6 of a success on the basis of m 
observed successes in N independent trials. Suppose g(6) is uniform over the interval (0, 1) and 


L(0,6) = 10-8] 
6(1— 6) 
Show that the Bayes estimate of 0 is 
0 if m=0 
6=41 if m=N 
0 if m#0 or N, 


where 0, is the median of the distribution 
TN) 

T(m)C(N -m) 

(Basu and Thompson, 1996). 


Let X,, X,, ... X, be a realized random sample of lifetimes from a population having exponential 
density f(x|®) given by 


(6m) = end-oy 


1 
f(x |0)= a (x > 0). 
and let expected lifetime @ has the prior density 


s(@)= oe" 9 
ri) — 


Show that the Bayes estimate of the survival function 
S(x, | 0) =P(X >x, |) =e", x, >0 


and its posterior variance are given by 


. x —(B+n) 
S(t) = c +—! 
at+t 
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6.66 


6.67 


6.68 


and 


2x —(B+n) Pe —(B+n) 
Var(S(x, jnlna[te oe) {1 u : 
a+t att 


Let X and Y be independent with bivariate binomial distribution 


f(y |P).P2) -(7 | SJorera-eyr-a—ear = Oslo YHOsl uM 0<(8,,05) <1. 


Suppose that 0, and 0, are a-priori independent of each other and having 
U(0, 1) distribution. 
(i) If the loss function is squared error of the form 
LO, —6,,8) = (0-8, =O) 3 
find the Bayes estimate 5 of (6 —9,). 


(ii) | Obtain the Bayes estimate b of ¢=96,/6, under the relative squared error loss 
function 62(@-$)’. 


(Thompson and Basu, 1996) 
Consider the loss functions 


® iL -A if A<0 
1 = 
: aA if A>0,a>0 
-A? if A<0 
i) 4L,({A)= 
@ 140) ee if A>0,a>0 


(it) = L(A) =e +cA? -aA-1;(a,c) > 0 
If A= (6-6), show that the Bayes estimate under L, is (1+a)'th quantile of the posterior 


distribution of 8, whereas, Bayes estimate under L, is a number 6, which satisfies the equation 
Pe 6 4 me 
6 = 0+(a-1){ 6g(0| x)d0 1+(a-1 f g(0|x)do} , 
where @ denotes the posterior mean. The Bayes estimate @ , under L,, is given by 
Ko 1 IG — « 
0 = 8 sine +oe{ 1+ = 6-6) | 
a a 


where 6 is the Bayes estimate under linex loss function. (Note : Thompson and Basu (1996) 


Linex 
call L, as “squarex loss function”). 

(Iliopoulos, 2003) 

Suppose X ~ Bin(n, p), where pe(0, 1) is a known constant and ne {1, 2,...} is an unknown 
parameter Let the prior distribution of n be negative binomial 
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r+n-l 4 
g(n|r,8) = (-6)'0""; n=1,2,...5 r>0,0<O0<1. 
r 


Show that the Bayes estimate of n, under SELF, is 


eee ifx =0 
; 1—8q 
= -1)0q ( 0 eed 
Z ma ) a 4 aa if x21 
1-€6q 1-@q 1-@q |x+(r—-1)6q 


where q = I-p, w hereas, under the loss L(n, a)= (a—n)’/n, the Bayes estimate is 


Gu~)eq__itx =o 
(1—8q)(1— — @q)) 
ea ep 
1-Oq_ 1-6q 


6.69 (Shapiro and Wardrop, 1980) 
Let the rv X has a pdf 
f(x|0) = exp(xo(8)-(8)) n(x) 
and 8 has natural conjugate prior density of the form 
(9) o exp((8)T, — y(8)N,). 
For the loss function, L(0,6) = (6) exp[a(@)u—y(6)v](0—6)?, where a’(®) is the Fisher’s 


information, show that the Bayes estimate of 0 is 6= (T, + =x; +u)/(N, +n+Vv) and the 


posterior expected loss is, 


i} exp[o.(0)(T, + 2x, +u)—y(0)(N, + n+ v)]d0 
(N, +ntv)f explo(6)(T, + Ex,)— (O(N, +n)]d0 


6.70 Suppose X,, X,, ..., X, is a random sample from N(u, 6”). If population coefficient of variation 
is known to be a constant k, then show that Jeffreys’ non-informative prior for p is, g([t)or 1/U. 
Obtain the Bayes estimate of {t under (i) squared error loss function and (ii) relative squared error 


= 


loss function ta ; 
u 


6.71 (Dey and Micheas, 2000) 
Consider the loss function 


Lo.) =e] e212) 
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6.72 


71 


72 


is 


74 


u 
r(r—1) 
(i) the form of the loss function L,(6, a), and 


(i) further if the prior for 0 is conjugate prior, then find Bayes estimate of 0 under L,(0, a). 
Consider a random sample of size n from the Bivariate normal population with mean vector 


If g(u) = ,0<r<l1 and f(x|0) is N(0, 1), obtain 


u= s and known variance covariance matrix 2 = , 5 1 


‘ 2 


} If the prior distribution of 


0 1 0 
is also Bivariate normal with mean vector 4 and variance covariance matrix r ‘ find the 


MELO estimate of 0 = A (Recall that the relevant loss function is (U, —,6)"). 


LM, 


Chapter 7 


(Robert, 2001) 

Consider X ~ Pois(®). The hypothesis to test is H, : 8 < 1 versus H,: 0 > 1. Give the posterior 

probability of H, for x = 1 when 6 ~ Gamma(q, ). 

(i) How does this probability get modified when & and B go to 0? Does this answer depend 
on the rates of convergence of a and 6 to 0? 

(iit) | Compare with the probability associated with the non-informative distribution g(8)=1/0. 
Is it always possible to use this improper prior? 

Consider X ~ Bin (n, 0), H, : 9 = 1/2 and H,: 6041/2. The prior distribution g(0) is a Beta (a, 0) 

distribution. Determine the limiting posterior probability of H, when n = 10, x = 5 and n= 15, 

xX = 7 as & goes to too, Are these values intuitively logical? Give the posterior probabilities for 

Bayes-Laplace, Jeffreys, and Haldane non-informative priors. 

Let X,, X,, ..., X, be a random sample from Bernoulli distribution for which the value of the 

parameter @ is unknown. Suppose prior distribution of 90 is such _ that 

P(6=1/2) = p > 0 and the remaining probability (1—p) is uniformly distributed over the interval 


0<0<1. Find the posterior probability that 0=1/2 when > & =t, and hence show that this 
i=l 
posterior probability is greater than the prior probability p if, and only if, 
(i 
—| >—. 
t || 2 n+l 
(Leonard and Hsu, 1999) 


Let X~Bin (n, 9). Given X = x, we wish to investigate H, : 0 = 0, against H,:0 # 0,, where 0, is 
specified in advance. Suppose we employ a prior distribution as follows: 
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76 


77 


78 
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@) P(A) =p, P(H,) = I-p 

(ii) | Given H,, 0 is beta distributed with parameters o, B. 

Based upon this prior assessment, show that P(H,| y) = Rp/(pR+1-p), and fully describe the 
posterior odds ratio R in favour of H,. 

Let X,, X,, ..., X, be a random sample from N(0, 0”). Under model M5 o? = | and under model 
M,, 0°=2. In either model the prior distribution of 8 is N(u, tT). Show that as T—> °°, the Bayes 
factor for model | against model 2 tends to 


aie onl -S (x, -x) 4 
i=l 

Confirm that this tends with probability 1 to infinity if o? = 1 and tends to zero if 
Oo =2. 
Let X ~ Bin(n, 0) and 0<x<n. Under M, : X ~ Bin(n, 1/2), under M,: g,(0) = 1, 0€ [0, 1]. Show 
that 

By = (Ret, arly. 
A single observation X has the Pois(®) distribution. Under model M, : 0 = 1 and under model 
M,, 8 has Gamma (2, 4) density. Prove that the Bayes factor for M, against M, is maximised 
when x = 4. Generalise the result for the Gamma(q, 8) prior for 8. In the case when the prior mode 
equals one, show that the Bayes factor becomes 

e' (B+)? BOY PB+)/lB+x41) 
and tends to infinity as B > 0. 
(Berger, 1985) 
Theory predicts that the melting point 0 of a particular substance under a pressure of 10° 
atmospheres, is 4.01. The procedure for measuring 0 is fairly inaccurate due to the high pressure. 
It is known that an observation X has a N(0, 1) distribution. Five independent experiments give 
observations of 4.9, 5.6, 5.1, 4.6 and 3.6. The prior probability that 8 = 4.01 is 0.5. The remaining 
values of 0 are given the density 0.5 g,(0), where g, is a N(4.01, 1) density. Formulate and 
conduct a Bayesian test of the proposed theory. 
Suppose X,, X,, ..., X, is a random sample from N(0, r) where r is the known precision. We wish 
to test the hypotheses H,: 0 = 0, against H,: 04 0,. Use Jeffreys approach by assuming 
P(H,) = p > 0 and the conditional prior distribution of 8, when 0 # 0,, is N(u, Tt) defined over 
the space O-{0,}. 
(i) Show that the Bayes factor in favour of H, is 


ea o( | s ew -e-0y"|) 
T+n 2\|t+n 


Hence show that the Bayes factor tends to zero as |x |4 ©. For what value of x, the Bayes 


factor is maximum and what is the maximum value of the Bayes factor. Is it true that, for any 


fixed number of observations n, no value of the observations can increase the P(H, | x) by more 


than a limited amount? Is it also true that the posterior probability of H, will increase when the 
prior distribution under H, has a mean pt = 0,? 


Question Bank 395 


7.10 


7.11 


7.12 


7.13 


Suppose X,, X,, ..., X, is a random sample from N(8, r) where the mean 0 and precision r are 

both unknown. We wish to test the hypothesis H, : 0 = 0, against H, : 0 # 0,. Use Jeffreys 

approach under the following assumptions: 

(i) P(0=6,) =p>0 

(i) | The conditional pdf of r, when 0=8,, is the Gamma (c, 8); 00, B>0. 

(iii) The conditional joint pdf of © and r under H, is normal gamma such that 
g,(8 .r) = g,(Olr) g(x), where g,(O|r) is N(u, tr) and g,(r) is Gamma(a, B). Show that the 
posterior odds in favour of H, is 


p (t4+n 1/2 28+ YC, B+ (amie RW) ? 
onl T 2B +X(x, - x)’ +n(x-6,)° 


Show that the posterior odds depend only on the statistic n(x—0,) > —X)’ when 


i=l 


u=0, and B>0. 
Let X ~ N(O, 1) and 8 ~ N(0, 1). If the loss is specified by 
L(@,a;) 0 ifO620 
5A) ) = 
| if 0<0, 
0 if60<0 
L(0,a,) = 
1 if 020, 


where a, = action of accepting H, : 020 and a, = action of accepting H, : 0<0. If 
x = | is observed, perform the Bayes test for testing H, and H.. 

Let X ~ N(O, 1) and 6 ~ N(O, 1). Obtain the critical region C for testing H, : |0| < 1 against 
H,: |0| = 1 when the loss function is 


£(@,a,)-{0 F920 
PO oe: APSO. 
0 if <0 
L(0,a,) = 
8 if 00, 


where action a, is “accept H,”, i= 0, 1. 

A company periodically samples products coming off a production line. In order to make sure 
the production process is running smoothly the investigator chooses a sample of size 5 and 
observes the number of defectives. Past record shows that the proportion of defectives, 0, varies 
according to the Beta (1, 9) distribution. The loss in letting the production process run is 106, 
while the loss in stopping the production line, recalibrating, and starting up again is 1. What is 
the Bayes decision if one defective is observed in a sample? 
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84 
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Chapter 8 


Suppose a random sample X,, X,, ..., X, is drawn from a Gamma(2, 0) population. If the prior 
distribution of 0 is Gamma(a, b), derive the predictive pdf of the future observation X,,, and 
obtain predictive mean and variance. 


Let X =(X,,X,,...,X,) bea random sample from N(8, r), precision r known. If X,,, is the future 
independent observation from N(8, r) distribution and the prior density of 6 is N(u, T), derive 


the predictive density of X__,, given x. 


n+1? 


Let X,, X,, ..., X, be iid observations from N(0, 6”) and let Y,, Y,,..., Y,, be a future independent 
sample from the same distribution. If the joint prior distribution of (0, 67) is g(0, 67) << 1/0”, obtain 


the predictive distribution of the future sample mean Y, given the sample x. Construct the 


100(1-c)% highest predictive density interval for the future sample mean Y . 
Let X =(X,,X,,...,X,) be a random sample from N(@, 1). Assume the prior distribution of 0 


as g(0) = ag (0) + (1-a)g,(0), where g() and g,(0) are normal distributions with means | and 
—l, respectively, but common variance unity. Obtain the predictive pdf of the future observation 


X_,, given X. 


n+1? 


Suppose X,, X,, .... X,, X,,, be iid U(-0,0) random variables with 0>0. The prior for 0 has 
Pareto(a, b) distribution. Show that the predictive distribution of as given (X, IX siats X), has 


B : 
2ABt) if|x,,|SA 
B(Xn41 |X) = Bae . 
B+) x otherwise, 
n+l 
where A = max (a, |x,|, |x,|, ... [x,|), and B = btn. 


Let g(y|x) be the unimodal predictive density function. For an arbitrarily chosen €>0, suppose 


that the utility function U(a, y) is 1 for ly—al <e€ and 0, otherwise. Show that the optimum point 


prediction is the mode of g(y|x). 
Independent measurements X,, X,, ..., X,, X,,, of the log concentration of a constituent A of a 
pharmaceutical product constitutes a random sample from a density 

f(x|6) = 8 exp(x — 8e*);  O>0. 
Suppose the observed values of the first n random variables X,(i = 1, 2, ...n) are x,, X,, ..., X,. If 
the prior pdf of the parameter 8 is Gamma (@, ), find the predictive density of X,,,, given 
X,, Xj) «-, X,- Obtain optimal forecast of the future observation X,,,, based on x,, X,, ..., X,, Such 
that |x —X |< 1. 


ntl nt 


nt+1? 


: : js 0 if|y-ykKl 
Hint : Use loss function L(y, y) = : : 
1 otherwise 
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8.8 


8.9 


8.10 


8.11 


8.12 


Let X =(X,,X,,...,X,) be a random sample from N(0,0’), 6? known. For the U(-1, 1) prior 


density for the parameter 0, derive the prior predictive density g(x) of the sample mean x. Show 
that 


<tog 9(X)= <r (EO |x)-x), 

Let X ~ Pois(®) and = be an arbitrary prior predictive density of x. Show that 
E(log 0 | X) = ‘P(x +1) + Slog g(x), 

and 


2 
Var(log 0 | X) = P(x +1) + log @(x), 
x 
where ¥(z) is a digamma function defined as W(z)= $0g T(z), and ¥’(z) is a trigamma 
Z. 


2 
function defined as Y’(z) = Slog T(z). 
Zz 


Let X ~ Bin(1, 8), then 


(of 
fof 


Let X represents the life of a component with pdf f(x|6). If R(t) = P(X = t), show that, under SELF, 
the Bayes estimate of R(t) is 


ox 


; = ! tog g(x), 


and 


o 
‘} Awe = log g(x). 


R(t)=[ gly | x)dy, 


where g(y|x) is the predictive pdf of the future independent observation from the same population, 
given x. 


If Fe1=5e0(- a} x,0>0 and Ones find R(t). 


(Aitchison and Dunsmore, 1975) 

Consider the problem of machine tool replacement. Suppose life time of a machine tool is 
distributed like f(x|0) = 8e™; x, 6>0. Let c be the cost per minute of machine tool left unattended 
(but worn out) until such time as it is inspected (and immediately replaced), d per minute be the 
cost of attendence by an inspector and the overhead cost of replacing the machine tool before 
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it wears out be A. Construct the loss functions and obtain the optimum time when the inspector 

should be sent for the following replacement policies: 

(i) Send in the inspector at the time ‘a’ to replace the machine tool immediately if it has 
already worn out, otherwise to attend the tool until it wears out. 

(ii) | The inspector attends the tool from its start until it wears out or for time ‘a’ whichever 
is shorter. If the tool has not worn out by time a, replace it. 

(iii) | Send in the inspector at time ‘a,’ and have him replace the machine tool as soon as it 
wears out or at time a,, whichever is earlier. 

[Hint: the predictive distribution of the lifetime Y of the future machine tool, given the lifetimes 

of n earlier machine tools, with Gamma(a, b) as the prior for 0 is 


a+n -a-n-l 
BGs roe de O48 s | [o+3 sty] jbsen) 
i=l i=l 


In order to obtain favourable terms a theatre ticket agency has, at the end of the first week of 
a new show, to give a firm commitment to take a fixed number of theatre seats for each daily 
performance. The agency reckons that the daily demand of its clientale is Poisson distributed 
but it is very vague about the mean parameter. The numbers of first six performances are 8, 6, 
3, 7, 2, 5. For each ticket sold the agency makes a profit of 50 paise for each ticket unsold, a 
loss of Re. 1. What fixed number of seats per day should the agency order, and what is its 
expected profit with this number? 

(Aggarwal, 2006) 

Consider the Binomial superpopulation model of Example 8.21. Show that the Bayes prediction 
of the population mean ‘w’ under Zellner’s BLF 


, OO A a 
Lu.) =— (X; -f)’ +(1-@)(u-f)’,0<s <1, 


i=l 


with Bayes prediction risk 


(a) = oD 4 4) ‘ati et A 1 ( oa+b)+n ab 
aoa n Nj} |N-n nl a+b+n (a+b)(at+b+l1) | 


Consider the Binomial superpopulation model of Example 8.21. Find the Bayes predictor of 
population total with respect to the linex loss. 
(Aggarwal, 2006) 
Consider X,, X,, ..., X,, a random sample from a superpopulation having pdf belonging to a 
regular one-parameter exponential family 
f(x | 0) = h(x)exp[0x-W)], 
(i) For the natural conjugate prior for 0, 
(9) = b(v,,v,)exp[v,8-v,w(9)], 
show that the Bayes predictor for the population mean ‘wy’ is 
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9.1 


92 


~ no N-n]} _ v, +x, 
f= —x,+ ox, +(1-@) 
N N v,tn |’ 


when the loss function is BLF. 
(ii) | Further show that the Bayes prediction risk is 


@) -|[-Pea-of 1-8) i= —+ [eo oo |f por 


2 2 2 
si” | Gey) ==) Bl wi@-—4) 
N v,+n Vv, 
In particular, if the sampled population belongs to NEF-QVF, i.e., variance is the quadratic 


function of the mean, 


w"(8) =a, +a,w’(0)+a,(W(8))’, 
where a,, a,, and a, are constants and are not all zero, show that the Bayes prediction risk 
simplifies to 


=| Peco 1-2) | I +1( eta | metazoan een) 


N-n n| v,+n (v,—-a,) 


(iii) | Show that N(0, 07), o? known is a member of NEF-QVF. Hence obtain the Bayes predictor 
of the population mean and the corresponding Bayes prediction risk. 


Chapter 9 


Consider a consumption function y, = Bx,+ u,, t=1,2,...,15, where x, denotes permanent income 


. th . . . th . ae 
in the t" period, y,denotes permanent consumption in the t” period, and u, are iid N(O, 2.25) 


15 15 
random variables. Suppose it is given that > AY; = 1248.28 and by a = 1392. 


t=l t=l 


(i) Find the 95% HPD interval for 6, when the prior pdf for 6 is N(0.85, 0.06’). 
(ii) | Suppose that Var(u,) is o* (unknown). If the joint prior distribution of (8, 6) is such that 


(Blo) is N(0.85, 62/225) and g(o)< 0", find the 95% HPD interval for 8. Find the Bayes 
factor in favour of H,: B<1 against the alternative H, : B>1. 

(iii) | Obtain the predictive pdf for y,, given that x= 15. Hence obtain Bayes estimate of y ,, 
when the loss is SELF. 


Consider the model y = x,B,+ x,8,+ u, where u~ MVN(O, J). 
1\/2 0 
(i) Find the posterior distribution of § when the prior for B is ovn((i}{c al and the 


2 


sample information is n=5, X’X = : fe 


1 
} xX’Y )) and Y’Y = 40. 
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(ii) | What will be the marginal posterior density for B,? 
(iii) | Assuming prior odds of unity, find the Bayes factor of H,: B,= 0 against H,: B, #0. 
Under the conditions in case (2) of Section 9.1, show the predictive distribution of future 


independent observation Y ,,, given x,,, and z, is a 3-parameter t-density with (n—1) df, location 


parameter Bx 


n+l 


— x2 n 
and variance ores xia |S i and hence construct 95% 


predictive interval of y, 
Under the assumptions of heteroscedastic model (9.12), show that the MELO estimate of 1/B is 


$4) boses8f S808) [82] 


for w. = I/x?. 
Consider the Poisson regression model of Example 9.6 with the Balanced loss function defined 
as 


Wr yh=— 23 =T) #0 =oT=1),0<0<1. 


i=l 
Show that the Bayes predictor T of T is 


T = ony, +(1-)(u + ny, (N—n)x,/(v+nx, ). 
Consider the problem of estimation of the regression coefficient of a inverse Gaussian regression 
model 

Y, ~ Inverse gaussian (Bx,, kx,”) 
when the coefficient of variation is assumed to be a known constant. Obtain the Bayes estimate 
of B under SELF using the conjugate prior for p. 
(Bolfarine and Zacks, 1991) 
Consider simple regression model y, = Bx, + u, where t = 1, 2, ..., n and 
u, ~ N(0, 6’), 6? known. If the prior distribution of B is N(b, 6,2). Show that Bayes predictor of 


N A 
the population total T = yi , under squared error loss function, is dy +B>) X; where B 


i=l ies ies 


is the Bayes estimate of 6 under squared error loss function, and the prediction risk is 
E(T- T) =o (N-n)x 
where 


N 
x=— : iy (N-n)x, = x Xx, 
N i=l i=n+l 
Discuss the case when the prior for B is non-informative. Obtain the Bayes estimate of 
population total T, when o? is also unknown and the joint prior of B and 07 is g(B, 07) e 1/07. 
Also show that the Bayes prediction risk is equal to 
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E(T-T)’ _ om i=l a ; 
Xj 
i=l 
where 
a 1 | (y,-Bx;) 
2 i i 
oO. = r 
: 3 | : 


9.8 (Bansal and Aggarwal, 2007) 
Consider the heteroscedastic regression model, in which the variance of the dependent variable 


is square of its expectation, so that, y, ~ N(Bx,,B’x’), i=1,2,..,.N, Be (9,09). 
(i) Construct the conjugate prior for B. 


N 
(ii) | Obtain the Bayes predictor of the population mean p=) Y,/N, under BLF 
i=l 


Lif) =" (y, -)* +a), 0S @<1, 


and hence obtain the minimal Bayes predictive expected loss. 
99 Suppose that for the conditions in Section 9.7, the variance 6? is unknown. Show that the 
predictive expected loss is minimized for 


; f Var(6 | y) % 
= 1+ 5 
E®ly)| Ely) 


Chapter 10 


10.1 Let X=(X,,X,,....X,) be a random sample from N(0, 6’), variance o? unknown. Obtain 


approximate posterior distribution of (i) 0”, (ii) 6 and (iii) r=1/o0*. Give the conditions under which 
the approximation is true. 

10.2 Let X,, X,, ... X, be iid observations from N(0, 0’) distribution. Assume that the joint prior 
density g(8,logo) « 1. Show that the normal approximation to posterior distribution of (0, logo), 
where o > 0, is 


—\2 
y 57 n (X,-X 
N i : ae wae. . ; 
logs6 0 1/(2n) = n 


Hence show that the normal approximation of marginal distribution of 67 is 


—\2 
a8) 26! m2 _ (X;-X) 
Ne = } where 6G; or : 
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(Bernardo and Smith, 1994) 
Suppose X,, ..., X, is a random sample from N(@,, 1) and another independent random sample 
Y».5Y, from N(@,, 1). Show that the joint posterior distribution for 0 = (0,,0,) is approximately 


x 0 
distributed like pv |(¥ (0 }} Hence show that the asymptotic posterior distribution of 
Yn n 
> =2 
= 0,0, 0,40, is N] ~*,ny?} —2— |]. 
Yn xX, a Yn 


Let X ~ N(0, r), and the joint prior distribution is g(8, r) c 1/r, 0. Construct a 95% HPD credible 
interval for @ based on a sample of size n. In what respects is it different from a 95% confidence 
interval for 8?. 

Use Lindleys’ approximation to obtain Bayes estimate of the normal mean with known variance 
under the squared error loss function. 

Suppose X,, X,, .... X_, are iid observations from N(0, 6”) with known coefficient of variation 
c = 0/0, so that the underlying distribution is N(0, c67). 

(i) Obtain normal approximation to the posterior distribution of 0. 


(i) | Use Lindley’s approximation method to find E(®|x) under Jeffreys’ prior for 0. 


Suppose X,, X,, ..., X, be n independent Bernoulli trials with probability of success 0.If the prior 


distribution of 0 is Beta(2,2), obtain rte 8 : using Lindley’s approximation. 


Suppose X,, X,, .... X, be a random sample from Pois(0). Obtain Lindley’s approximation of 
E(log 8|x) when 

(i) the prior distribution of 8 is Lognormal (1, 6”), and 

(ii) the prior distribution of 8 is Gamma (1, 1). 

Suppose X,, X,, ..., X, be a random sample from f(x|0) = 8e*®. Assuming prior distribution of 0 


to be Gamma(a, 8), obtain E(e™ | x) using Lindley’s approximation. 


Suppose X,, X,, .... X, is a random sample from the inverse Gaussian distribution having pdf 


1/2 = ya 
exp] MOI | x20.2>0,4>0 
2u°x 


F(x | Ayu) = a 


Sas 6 a{ 1 ol 
(i) Show that the mle of 1, and A are W=x,andA=n S53) 
i=l rt 


(i) ‘If g(A, w) e I/A. Show that Lindley’s approximation of posterior means E(u|x) and E(A|x) 
are 


eg 1 
E =ynw+— +O] — |, 
(u| x)= [=] 
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10.12 
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and 


E(A| x)= a y ya of - , respectively. 


Suppose X,, X,, ..., X, be iid Bernoulli (1, 0), 8,<0<6,, where 0, and 0, are known constants in 
(0, 1). Assume that the parameter 0 has truncated Beta(a, b) with density 


1 
g(0) = Bape (8, - ) ae /(0, -0,)°"°,0e [8,,9,]. Obtain the Lindleys approximation 


of posterior mean E(6| x) . 


Suppose X,, X,, ..., X, are lid Pois(®), 0,<0<0,, where 0, and 0, are known constants in (0, ©). 
Assume that the parameter 0 has Gamma(q, {) prior distribution truncated on [0,, 0,]. Obtain 


the Lindley’s approximation of posterior mean E(6| x). 
Suppose X,, X,, ..., X, are iid N(O, 1), 0,<6S0,. Assuming truncated normal prior N(u, t”) over 
(0,, 9,) obtain Lindley’s approximation of posterior mean E(0|x). 


Derive Tierney-Kadane approximation of E(6|x), where @ is the unknown mean of the normal 
distribution having variance unity and the prior distribution of 6 is N(0, 1). 

Suppose X,, X,, .... X, is a random sample from Poisson with unknown parameter 0 and the prior 
distribution of 6 is Gamma(a, 6). Use Tierney-Kadane approximation to show that 


1 
O+=x; —— 
+ =x. + =x. (2 
E(6|x)=— al lias +o} 
n 


(B+nje| o+2x, -1 


provided + > x; >1, and compare it with that obtained by Lindley’s approximation. 


i=l 


Chapter 11 


Consider the sequence S| = (X,, X,, ..., X,) of independent Poisson variables such that 


x Jf 10) i=12..m 
' £(x|0,) i=m+4l.....,n 


where 1<m<n-l, f(x|0)=0%e°/x!, x =0,1,2,...; 0>0. 
Suppose that the prior density of (0,, 0,) is 

2(8,,0,) 6% te ®°9% te; 0, ,0,,8,,8, >0. 
(i) Find the marginal posterior mass function of m. 


(ii) Find the marginal posterior density of 0, and 8,. 
(iii) | Find the marginal posterior density of 0,,. 
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We are assuming that 0,, 0,, and m are a-priori independent and g(m) is a discrete uniform over 
m= 1, 2,....n-1. 

(Menzefricke, 1981) 

Suppose X,, X,, ..., X, is a sequence of independent random variables such that 


N(6,,4) i1=1,2,....m 
'  |N(0,,) i=m+l....jn 
with unknown means 0, and precisions t j = 1, 2. The change point m(21) is also unknown. 
Obtain the marginal posterior distribution for m and the magnitude of the change w = r,/r, 
(i) Assuming that 8, = 0, = 8 (known) and the joint prior distribution of m, r,, r, is such that 
g(m, r,, r,) = g(m)g(r, g(r,) and g(r,) is Gamma(o, B,), j=1,2, and g(m) is discrete uniform 
over 1, 2, .n-l. 


(ii) When the joint prior g(m,6,,0,,1,,1,) = g(m)g(r, g(r, )g(8, | 1,)g(8, |t,) is such that g(t) 
is Gamma(o., B:) and g(6|r,) is N(u,,7;1,),J=12, and g(m) is a discrete uniform 
distribution. 


Let X,, X,, ..., X, X,,)5 +) X, be a sequence of independent random variables such that 


m? m+1? 
N(0,r,) if i=1,2,....m 
NO.) ifi=mt+lm+2,....n;1<m<n. 


Ifr,~ Gamma(o.,, 1) and r, ~ Gamma(,, 1), derive the marginal posterior distribution of the change 
point m. You may assume that r,, r, and m are a-priori independent of each other and the prior 


nl 
distribution of m is such that g(m) = p,, m= 1, 2, ..., n—1 with ay P,» =. 


m=1 
How will you detect whether there was actually a change? Discuss and give details. 
Consider a sequence X,, X,, ..., X, of independent gamma random variables with a changing 
scale parameter 9, i.e., 


x Jf 10)) i=12,...m 
'  |£(x|6,) i=m+1,m+2....,n, 


where 


a-l .-x/6 


£(x|0)= 10 > 0,0.>0,x>0, 
(04 


If m, 8,, 0, are a-priori independent such that 
p if m=n 
g(m) =} 1- 


so) 


ifl<ms<n-l]; 0<p<l, 


e 


n= 


and 


x00 = = Ta,)o'", if m=n 
6,0 


rl 
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Develop a Bayes test to detect the shift in the sequence. 
Consider a sequence X,, X,, .... X_,..., X, (n23) of independent normal random variables with 
changing mean 9, i.e., 
N(0,,1) i=1,2,...,.m 
'  |N(0,,1) i=m+4l......n. 


If m, 0,, 0, are a-priori independent such that 


p if m=n 
g(m)=41-p 


n- 


ifm#n; 0<p<l 


and the prior pdf of 0, is N(u, T,); 1= 1, 2. Develop a Bayes test to detect the shift in the mean. 
Consider a sequence X,, X,, ..., X_, X .., X, (n23) of the independent Poisson random 
variables with changing mean 9, i.e., 
Pois(®,) i=1,2,....m 
' |Pois(@,) i=m-+l.....,n. 


m+1? 


If m, 8, and 9, are a-priori independent of each other such that 


; if m=n 
g(m)=; 
if m#n 
2(n-1) 


and g(0.) is Gamma(1, B.), i= 1, 2. Develop a Bayesian test to detect the shift in the mean value 
of 0. 

Same question as Question (11.11) but precision is taken to be r (not 1). 

Suppose we have 10 coins out of which one coin is a fake coin having unknown probability 
8 (# 1/2) of getting a head. Each of the 10 coins are flipped 100 times and 48, 53, 20, 45, 50, 55, 
43, 49, 51, 52 are the obsreved number of heads. Use structural change formulation to 

(i) find the fake coin, and 

(ii) | obtain the estimate of 8, assuming non-informative priors for the unknown parameters. 
Consider a sequence of independent Poisson variates such that X~Pois(A,), i=1,2,...,.n,i# m, and 
X_~Pois(A,). Let us assume that m is independent of A, and A, and that its prior mass function 
is g(m)=1/n, m=1,2,...,.n. The prior for 4, is Gamma(1, 2) and that for A, is Gamma(1, 10). Use 
structural change formulation to 

(i) identify the outlier, and 

(ii) obtain the Bayes estimates of A, and A. 


If X ~ Pois(®) and g(@|A)=Ae™; @>0,A>0. Show that the Empirical Bayes estimate of 0, 
based on current independent observation X, may be expressed as 


Fil? 
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x 
eo 1 ne +))= =a +1), where 4 is mle of A based on historical data Xs Rit Rs 
X 


Suppose X ~ N(@, 0”), 0” known. Show that the Empirical Bayes estimate 6,(x) of 8, under SELF 
and the prior distribution g(6), is 


mx) 
é§ +o 
(x)=x+o m, (x) 
where 
m0) =f fx/Oe(d8 ma) = [se | 0) Jaron 


Let x,, X,, ..., X, be a realized sample of size n from a N(Q, 1) distribution and assume that the 
prior distribution of 0 is N(u, 07). Show that the maximum likelihood estimates of the 
hyperparameters Ls and 07 are 

=x, 
and 


Consider the power series distribution 


a(x)0" 
u(0 


f(x |) = ,x =0,1,2,...,0>0. 


Obtain empirical Bayes estimate of 8 under squared error loss function with respect to the prior 
pdf g(8). 

(Robert, 2001) 

Let X, ~ Pois(®), i= 1, 2, ...n. If the prior distribution of 8 is exponential with hyperparameter A. 
Show that 


x, |A~ Geol is } 1='1,2...3i, 
A+1 


The entropy loss associated with the m(x|A) is 


: x 1 es 
L(A,A =| x 1 l ’ 
avert 


and for g(A|d)=A~“"; d>0, show that the Bayes estimate of A is 
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Consider the problem of Hierarchical Bayes estimation discussed in Example 11.10. However, if 
hyperprior for B is improper uniform g,(B) 1, B20, show that the posterior distribution of 
vy =(1+B)" is of the form 


g(y|x)<y""(-y)". 
Hence show that the posterior means of 0,’s are shrunk by a factor (nx —1)/(nx + n) relative 


to the usual classical procedure which estimates each 0, by x,. What happens if nx <1? 


(Dey, 1999) 
Suppose X,, X,,..., X, is a random sample from N(0, 1). If g,(O|A) is N(O, (1-A)/nA), 0<A<1, and 
g,(A) is Beta(o., B), «>0, B>0. Show that the Hierarchical Bayes estimate of 6 under balanced loss 
function 
L(0,6) =" S\(x, -6)? + (1-@)(0-6),0< w<1, 
n 


i=l 


(alz)= (p+20-2) .F((p/2)+0+1(p/2)+a+B-+1;—nEx; /2) 
(A (p+20+2B-2) iF, ((p/2)+05(p/2)+0.+;—n3x; /2) 


and ,F(+*;*) is a confluent hypergeometric function. 

(Bansal and Singh, 1999) 

Suppose X,, X,,..., X, is a random sample from N(O, r) precision r. If the prior distribution of r 
belongs to the class of non-gamma K-priors defined as 


g(r) = g(r) + A383 (r)+ NBs (r), 
where 
g(r) is Gamma(a,b) distribution, 


7 va 3 

g,(r) = PEED es =a © Jas 
_ an,L,(r) 

B.(1) WA(a+i(a+2)\@4+3)er» 


d k 
L,() is a Laguerre polynomial of degree r defined by| a rgy(r)=L,(r)g(1) ; 


k=0,1,2...., 

and A, and A, are measures of departure of skewness and kurtosis of g(r), respectively. 

Find the posterior distribution of r and study its sensitivity to the misspecification of prior in 
terms of Kullback-Leibler divergence measure. 

(Note:- When 4,= 4,= 0, g(r) = g(r.) 
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(Bansal, 1978) 

Suppose X,, X,,..., X, is a random sample from N(@, 1). The prior for 0 is 
g(9) = g,(8)+ A383 (0)+ MBq (8) + A386 (9), 

where 
g,(0) is N(u,T), precision T, 


g,(0)= =H, (Ve(0-n))8), 
g,(9) = aH. (Vz (0-1))g, (8) > 


1 
(0) = He (Vt(8-H))g0(0), 


H, (+) is the Hermite polynomial of degree k, 


and A, and A, are measures of skewness and kurtosis, respectively. 

Find the posterior distribution of 8. Use Kullback-Leibler divergence measure to study the 
senstivity of the posterior distribution to non-normality in the prior. 

(Note:- When 2,=A,= 0, g(0) is N(u, T) density.) 

(Bansal and Chakravarty, 1996) 

Consider changing linear regression model 


ie +u, t=1,2,..,m 


Y. i 
B,X,+u, t=m-+l,...,n, 


t 
n>3, where u,’s are iid N(0, r) random errors with known precision r, and m is a change point 
taking values 1, 2,..., n-1. Consider m to be uniformly distributed over the set (1, 2,..., n-1) and 
is independent of f, and 8.. If the joint prior distribution of B, and B, is such that the marginal 
prior pdf of B, is N(b,, ro,) and the conditional prior pdf of B,, given B,, is a member of the calss 
of Edgeworth series distributions 


102 1/2 
2(B, |B.) = [sin | o0| an 21- 2-p be 


fey 
b, =b, +B, -b,), 


where 


H=1+— HAH O)+ 5 SIH ()+= SHC), 


and H,(+) is the Hermite polynomial of degree k. 


(i) Derive the marginal posterior mass function of the change point and obtain the Bayes 
estimate of m under BLF. 

(i) | Obtain the posterior odds ratio in favour of the hypothesis that there is no change in 
the model. 

(ii) | Discuss robustness of the Bayes estimate of the change point with respect to the 
misspecification of the prior distribution of B., B,,. 


Glossary 


A-priori: A term introduced by Immanuel Kant to denote a proposition whose truth can be known 
Hello!independently of experience (Jaynes, 2003, pg. 87). In Latin, it means ‘from causes to 
effects’, whereas, the term a-posteriori means ‘from effects to causes.’ 

A-posteriori: See a-priori. 

Action Space: It is the set of all possible actions denoted by o& 

Actualization Principle: (Principle of inverse probability or Bayes theorem) It describes the updating 
of likelihood of an event A from P(A) to P(A|E) once an event E has been observed. 
Ad-hockery: It is the freedom to invent new estimators, confidence intervals, or hypothesis tests. 
DeFinetti (1974) called this freedom as ‘ad-hockery’. Classical inferential procedures are often 

called ‘ad-hoc’ procedures. 

Akaike’s Information Criterion (AIC): An index used as an aid to choose between competing models. 
It is defined as -2L_+2p, where L,, is the maximised log likelihood and p is the number of 
parameters in the model. The index takes into account both the statistical goodness of fit and 
number of parameters that have to be estimated to achieve the desired degree of fit, by 
imposing a penalty for increasing number of parameters. Lower values of index indicate the 
preferred model. 

Assumption: It is a tentative statement on which an initial action can easily be based. It is not a 
statement of unquestioned truth. In other words, the conditions under which a statistical 
technique gives valid results. 

Axiom: A proposition in any scientific theory that is so constructed that it is taken as the starting 
point and does not have to be proved for that theory and from which the remaining 
propositions of the theory are deduced in accordance with certain rules. 

Basu, Debabrata: He was born on July 5, 1924 in Dhaka. He was the first research scholar under the 
supervision of C.R. Rao at the Indian Statistical Institute. He is well known for his essays on 
the foundations of statistical inference and theorems on complete sufficient statistics. He 
served Florida State University from 1975 to 1990. A collection of his critical essays ‘Statistical 
Information and Likelihood’ was edited by J.K. Ghosh and published by Springer-Verlag. 

Bayes Decision Rule: A decision rule 568 which minimises the Bayes risk r(g, 5) of 5 with respect to 
the prior g over all 5 in the decision space D. 

Bayes Definition of Probability: The probability of any event is the ratio between the value at which 
an expectation depending on the happening of the event ought to be computed, and the value 
of the thing expected upon its happening. 

Through this definition, Bayes sought to remove focus of the specification of an a-priori 
distribution from the forever unobserved 0 and placed it on the ultimately observable X. 

Bayes Estimate: A Bayes estimate of the parameter 0, associated with the prior distribution g(6) and 

the loss function L(0, 5(x)), is any estimate (x) which minimises posterior expected loss. 
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Bayes Factor: A summary of evidence for the model M, against another model M, provided by an 
observed sample x, which is used in model selection. It is given by the ratio of posterior to 
prior odds, 


_ P(x|M,) 
° P(x|M,)" 


Twice the logarithm of B,,is on the same scale as the likelihood ratio statistic. 

Bayes Information Criterion (BIC): An index used as an aid in choosing between competing models. 
It is defined as -2L_+p logn where n is the sample size, L,, is the maximum of the log 
likelihood and p is the number of the parameters in the model. The index takes into account 
both the statistical goodness of fit and the number of parameters that have to be estimated 
to achieve this particular degree of fit by imposing a penalty for increasing the number of 
parameters. Like AIC it also prefers a model which has a lesser number of parameters and still 
provides an adequate fit to the data. If n28, this criterion will tend to favour the models with 
fewer parameters than those chosen by AIC. It is also known as Schwarz’s Criterion. 

Bayes Postulate: Rev. Thomas Bayes’ assumption in his ‘Essay’ that an unknown probability 0, 
considered as a parameter, is a-priori uniformly distributed. 

Bayes Principle: A decision rule 5, is preferred to a rule 6, if E(R(0,5,)) < E(R(®, 5,)), where the 
expectations are taken with respect to the prior distribution g(8). 

Bayes’ Problem: Given the number of times in which an unknown event has happened and failed: 
Required the chance that the probability of its happening in a single trial lies somewhere 
between any two degrees of probability that can be named. In modern notations, the solution 
to this problem, as given by Bayes, can be expressed as 


P(x, <x <x, |m happenings and n failures of the unknown event) 


= Pxra-wrds /fara—stan 
x 0 


Note that Bayes’ problem was not about drawing the Bayesian inference concerning arbitrary 
parameter, but about a degree of probability. 


Bayes Risk: The expected risk r(g, 5) = fRo, 5)g(8)d0 is called the Bayes risk of 5 with respect to 
the prior g. 

Bayes Risk of g: The Bayes risk of the Bayes decision 82 is called the Bayes risk of g and is denoted 
by r(g). 

P(B| A)P(A) 


Bayes Rule: P(A |B) = P(B] A)P(A) +P(B| A)P(A’) 


The conventional terminology for P(A|B) is the posterior probability of A given B and that 
for P(A) is the prior probability of A, since it applies before (or not conditionally) on the 
information that A occurred. 

The Bayes rule, expressed in terms of odds, is 


P(A’|B)__ P(B| AY P(A’) 
P(A|B) P(B|A) P(A) 


Glossary 411 


P(A’|B) 
P(A|B) 


The ratio is known as the posterior odds against A and is known as the 


P(A’) 
P(A) 
P(BIA’) 

B|A) 
Bayes Theorem for Random Variables: Let f(x|6) denote the joint probability density function (or pmf) 


for a random observations vector X and a random parameter vector 0. Then 


f(x, 8) = f(xl0)g(8) = g@|x)m(x) 
f(x | 0)g(®) 
m(x) 


prior odds against A. The other factor on the right is the Bayes factor against A. 


gives g(0|x)= , with m(x) 4 0. 


co £(x | 8)g(8) 
ec prior pdf x likelihood function 


g(8|x) is known as the posterior pdf for 0, given the data X, g(0) is the prior pdf for 8, and 
f(x|8), viewed as a function of 0, is the likelihood function. Note that the likelihood function 
is often written as /(8 |x) to emphasize that it is not a pdf of 8, whereas, f(x|9) is a pdf for 
the observations given the parameters. 

Bayes, Thomas: Rev. Thomas Bayes was a Presbyterian minister and pastor of a congregation at 
Tunbridge-Well, London. He died at the age of 59 years on April 7, 1761. He was elected 
Fellow of Royal Society on April 8, 1742, on the basis of an article under the title of ‘Divine 
Benevolence’ in reply to one on ‘Divine Rectitude’ by John Balguy. He took part in the 
controversy about metaphysical aspect of Sir Issac Newton’s work on the Doctrine of fluxions 
against Bishop Berkley. He is the author of two mathematical papers in the Philsophical 
Transactions of the Royal Society which were communicated to the Royal Society after his 
death by his friend Rev. Richard Price. 

Bayes work is noteworthy in three respects: 

(i) In his use of continuous rather than discrete framework. 

(ii) In pioneering the idea of estimation through assessing the chances that an ‘informed 
guess’ about the practical solution will be correct. 

(ii) In proposing a formal description of what is meant by prior ignorance. 

Bayesian Inference: Bayesian inference is the process of fitting a stochastic model to an observed data 
and summarizing the result by a probability distribution on the parameters of the model 
and on unobserved quantities such as predictions for future unobserved data. It consists 
of the following principal steps: 


(i) Obtain the likelihood ¢(6| x) as a function of 0. 


(ii) Obtain the prior distribution g(®) for the unknown parameter 0. 
(iii) Apply Bayes theorem to derive the posterior distribution as 


g(O|x) «= £(0| x) g(8). 
(iv) Derive appropriate statistical inferential statements from the posterior distribution such as 
point estimates, interval estimates, or probabilities of hypotheses. 
This approach to inference differs from that of the orthodox (or frequentist) inference mainly 
because it uses the prior information about the unknown parameter. 
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Bayesian Interval: An interval for the parameter that is given posterior probability. 

Bayesian Statistical Model: It is a model consisting of parametric statistical model f(x|®) and a prior 
distribution g(8). 

Bernoulli, James (1654-1705): He is also known as Jacob and Jacques, Bernoulli. He was born in 
Basel, Switzerland. His book ‘Ars Conjectandi’, posthumously published in 1713, was an 
important contribution to probability theory, in which he posed the problem of inverse 
probability. 

Closed under Multiplication: If f(-|~,) and f(-|a,) are two pdfs belonging to a family #such that there 
exists a pdf f(-|«,) in #such that f(-|o,) o f(-|o,) f(-la,), then the family Hof pdfs is said to 
be closed under multiplication. 

Closed under Sampling: A family G of prior distributions for 8 is said to be closed under sampling 
from f(x|®) if for every prior distribution g(0) in G, the posterior distribution g(0|x) « g(0)f(x|®) 
is also in G. 

Conjugate Family: Suppose P= {f(x|0); 0 O} is a family of distributions of the random variable X 
which is indexed by a parameter 9. Further suppose that the prior distribution of 8 is a member 
of some parametric family of distributions G, with the property, in the relation to B that the 
posterior distribution of 8 is also a member of G. If this is so, we say that G is a family of 
conjugate prior distributions relative to Y This property of prior distribution is also known 
as closure property with respect to sampling from P 

Coverage Matching Prior: Suppose that 0 is a scalar parameter and (x) and u(x) satisfy 
P(L(x) < O < u(x)| x) = 1-a, so that A.(¢(x), u(x)) is a set with posterior probability content 
1—a.. In general, the frequentist coverage probability of A, will not be 1-c. However, in some 
cases it is possible to have same coverage and posterior probability. For example, if 
X,, X,,..., X, is a random sample from N(0, 1) and @ is given a uniform prior then the interval 
A. = {x-n'?Z_,, X+n'°Z__,} has posterior probability 1-c, and also has coverage 1-1, 
where P(Z > z,.) = c if Z is N(O, 1). 

Criterion Robustness: It means that the sampling distribution of the criterion used to estimate 
parameters or test hypotheses about the parameters under the original model is not 
substantially affected by changing the model. It is often employed in non-Bayesian inference. 
According to Box and Tiao, criterion robustness is an inadequate investigation of robustness 
because as the model is changed the inference criterion should also change. 

Data Translated Likelihood: The notion of ‘data translated likelihood’ was introduced by Box and Tiao 
(1973) to motivate the use of uniform priors. 

The likelihood function is said to be data translated if it can be expressed in the form 


£(8| x) = g(0(8) — t(x)), 
for some real valued functions g(*) and t(+), with the definition of g(+) not depending on 


X and @ is one-to-one function of 0. For example, suppose x,, x,,..., X, is a random sample from 
N(0, 6”) distribution. Then 


1 
iG | x(a)" exo{ 32} s =x; /n 
oe) 


2\- 1 
ee (s*)""7(0") n/2 exp 520" 
ey 
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= exp ~5 (lo o° —log s*)—> exp(-(log o —log =) 


So for y = logo? - logs’, t(x) = logs’, (67) = logo”, we have 


g(y) = exp{~5ay-jnew(-s)| 


in the data translated form. 
However, if X ~ Bin(n, 8) with n known, one exactly cannot find (8) for which the likelihood 
can be expressed in the data translated form. But using the fact that the transformed variable 


-l 


Z=sin'J/X/n has asymptotically normal distribution with mean y= sin J/6 and variance 


1/4n, the transformation yw = w(@) will put the likelihood approximately in data translated form. 
The difficulty with the idea of expressing likelihood in data translated form is that there is no 
way to check whether it is at all possible. 

Decision Function: A real valued function 6, defined on the sample space x, maps x into the action 
space o& 

Decision Space: A decision space D is a set of possible decision functions defined on x. 

Deduction: A method of inference or research. It denotes authentic proof (or inferring a conclusion) 
from one or several earlier premises on the basis of laws of logic. 

Default Prior: A prior that is chosen automatically without any contemplation of its suitability in a 
particular problem. 

DeFinetti, Bruno (1906-1985): He was born in Innsbruck, Austria. DeFinetti was a leading probability 
theorist for whom the sole interpretation of probability was a number describing the belief of 
a person in the truth of the proposition. He made a statement “probability does not exist” 
meaning that it has no reality outside an individual’s perception of the world. He was also a 
major contributor to subjective probability and Bayesian inference. 

Deviance Information Criterion (DIC): A measure of the extent to which a particular model differs from 
the saturated model for the data set. It is defined as D = —2(InL, -InL,) where L, and L, are 
the likelihoods of the current model and saturated model, respectively. Large values of D 
indicate that the current model is the poor one. 

Device of Imaginary Results: Let f(x|6) be the pdf of the random variable X and the prior distribution 
of 8 be g(8|~), B&O, with hyperparameter m. Then the marginal density (or prior predictive 
density) of X is 


m(x|a)={ f(x|@)g(@|o)d0, xe S$ 
°e 

The device of imaginary results implies that one should select g(8|) in such a way that 
it is compatible with one’s choice of m(x|«), x¢S. This can be achieved either by ‘fitting’ m(x|c) 
to imaginary predictions of future values of the observation or by fitting m(x|a) to past data 
collected under the same circumstances as the future observations will be observed. 

This device is intuitive since the experimenter is better able to think about an observable 
random variable than he is about an unobservable parameter. 

According to Stigler (1982), Rev. Thomas Bayes in his famous ‘Essay’ assumed a-priori 
that 

P(X= x) = 1(N+1); x = 0,7 1,..., N 
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to represent ignorance. In the binomial case, he further assumed 
N x N-x 
P(X =x|6)= 6*(-6)"*, 0<O6<1. 
x 
Therefore, for unknown g(8), the integral equation 


(N+ 1)" = free (l- 6)** do 
0 x 


gives g(8) = 1, which is uniform on [0, 1]. 

However, we know that 6 ~ U(0, 1) implies X is discrete uniform on {0, 1,..., N} but the 
converse may not be true. Further, one must restrict prior distribution to a parametric family 
such as conjugate class. 

Direct Probability: Given the known random process, including values of its parameters, make 
probability statements about outcomes or data produced. In other words, direct probability 
makes probability statements about the ‘effects’ (outcomes) of a random experiment from a 
given ‘cause’ (random process). 

Duality between Loss and Prior: The Bayes decision rule with respect to a prior distribution g(8) under 


a loss function L(8, 5(x)) is obtained by minimising J g(8)L(0,5(x))f(x | 0)d8. Note that the 


solution involves the product g(8)L(8,5(x)). The optimal decision remains the same when we 
interpret the above problem as, 

(i) prior pdf g(8), loss function L(0,5(x)), 

(ii) prior pdf 1, loss function g(0)L(O, 5(x)), or 

(iii) prior pdf h(0), loss function g(6) L(®, 5(x))/h(8). 

This exhibits duality between loss function and prior distribution in the sense that it is 
equivalent to obtain a Bayes decision rule for 6 in either of the three situations. 

Dutch Book: If an incoherent individual is willing to make a sequence of bets using incoherent 
probabilities, and considers each bet fair or favourable, then the individual will suffer a net 
loss no matter what happens. Such a bet is called a Dutch book. Note that a person or an 
individual is incoherent whose subjective conditional probabilities do not follow the Renyi’s 
axiom system. 

Empirical Bayes Procedure: Empirical Bayes decision procedures are a class of decision theoretic 
procedures that utilize past data as a means for by passing the necessity of identifying a 
completely unknown and unspecified prior distribution having a frequency interpretation. The 
above definition may be expanded to include those cases in which a prior distribution form 
is stated upto the values of the prior parameters that are then estimated by means of past 
data. The situation in which the prior distribution form is completely unspecified is called non- 
parametric empirical Bayes. 


Equivalent Prior Sample: Let G = {g(0|a):a¢ A} be the family of conjugate prior of distributions 
relative to the family ¥ = {f(x | 6) :@¢€ ©} of sampling distributions. The observed sample data 
transforms prior information, represented by hyperparameter a, € A, to anew value a,¢€ A 
representing the posterior information about 0. Writing &,= 0+ (o,— a, ), we have (a,— o, ) 


as a measure of information in the sample data. We can, then, regard the prior information as 
that provided by an ‘equivalent sample’ yeilding o,. 
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For example, if x = (x,,X,,...,X,) 1s the observed sample from f(x|6) = 8exp(-0x)and the 


conjugate prior g(6|r, A)is gamma(r, A) yielding gamma(r+n, A+n X) as the posterior distribution 
for 8. Thus we may regard prior information as ‘equivalent to’ a prior sample of size r from 
the exponential distribution yielding a sample total A. In other words, if g(0) is nil-prior 
gamma(0, 0), a sample of size r with sample total 4 will yield gamma(r, 1) as the posterior 
distribution of 0. 

Exchangeability: A set of observables (X,, X,,..., X,) are exchangeable if their joint distribution function 
is left unaltered by permuting their arguments. Mathematically, the set of observables 


xX, X X_) are exchangeable if P(X, <x,,...,.X, <x,)=P(X, $x,,...,X, <x, ), for all n! 
1 n g 1 1 n n 1 yy n iy 


grees 


permutations (x; »X;_) Of (X,,Xj,...,X%y)- 


Expected Utility Hypothesis: Let R = {r : 0 <r < oo} denote the set of non-negative monetary rewards 
and let P denote the probability distribution defined on subset of R, such that if one chooses 
P, he will receive a reward x, that is, numerical realization of a random variable X with 
probability distribution P. 
A real valued function U defined on the set R is said to be a utility function if it has the 
following property: 
Let P,, P, belong to Pbe any two distributions such that E(U|P,) and E(U|P,) exist then P, is 
not preferred to P, if, and only if, E(U|P,) < E(U|P,). For each reward reR, the number U(r) is 
called the utility of r and E(U|P), Pe B when it exists, is called the utility of P. The utility of 
a probability distribution is thus equal to the expected utility of the reward that will be 
received under that distribution. The hypothesis that there exists a utility function is called 
the expected utility hypothesis. 

Extensive Form of Analysis: A Bayes decision rule can be found by choosing, for each x, an action 
which minimizes the posterior expected loss or equivalently which minimises 


ip? Xis>- 


f ¢(8,a)f(x | 0)g(0)d8. Raiffa and Schlaifer (1961) introduced this term. 


Final Precision: The accuracy with which it is felt that the conclusion holds after the data has been 
observed. 
Fisher’s Information: Fisher (1925) defined information provided by an experiment as 


(0) = el ar logttx| 0) |= E(( Zvercsion} | 


the expectation being taken over possible values of x for fixed 8. The information depends on 


f(x|8). 
Fractional Bayes Factor: Anthony O’ Hagan (1995) defined Fractional Bayes Factor as 
B,, (x)= qi(b, x) , where 
q,(b,y)’ 


g;(b.x) = fg, (@,)F,(x | 0,40, /f 2,(0,)(£,(x|0,))” 40, , i= 
and y is the chosen training sample of size k from the data x = (x,, x,,..., x,), the fraction 
b=k/Mn. 
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For improper prior g,(0;) = c, h,(0,), the indeterminate constants c, and c, cancel out, and we 
have 


g;(b,x) = [h;(0,)£, (x | 6, )d0; /{n,(0;)(f,(x | 0,))’ d0,, i= 1,2. 
He suggested three ways to choose the fraction b 
(i) b=k/n, where robustness is of no concern, 
(ii) b =n! max(k, n’”), when robustness is of serious concern, 
(iii) b = n' max(k, logn) as an intermediate option. 
Frequentist: A statistician who evaluates statistical procedures according to their long run performances 
rather than focussing on the performance of the procedure for the obtained observation. 
Frequentist Approach: (Frequentist inference or frequentist theory) An approach in which statistical 
procedures are evaluated according to their long run performances, that is, on the average (or 
in frequency). 


g-priors: It is Zellner’s informative class of reference priors for the parameters 0? and kx1 vector of 


regression parameter B of the normal linear model 

y= XBt+u, u~MVN(O,0°l,). (1) 
Let Yo be an ‘imaginary’ sample generated by 

Yo= XBt+up, u, ~ MVN(O, o¢I, ) (2) 


and let o* =go;, 0<g<co. Consider diffuse prior p(B,o) «1/6, then the ‘posterior’ 


distribution, given the initial information D, in (2) and the diffuse prior, is 


p(B,o| Dy) <0" exp| -2{vs; +(B- B,)'x’X6-B,)}/20° | 3) 


where B, = (X’X)'X’yo, vs =(¥o —XBy) (yo - XB), v=n-k. 

If B, and o/ are anticipated values of B and o? then Muth’s rational expectations hypotheses 
allows one to take B, = E(B| D,) = B, and 6 = E(o’ | D,) = vgs, /(v— 2). The joint reference 
informative prior (3) may be written as 


Vo" 


p(B,0|D,,3,,B,)= 0" o- a Jr exp) 8 B-B,YXX(6-B,)| .  @ 


where G2 = gs, =(v—2)o; /v, is normal-inverted gamma prior. 
The marginal posterior distribution of 8, based on (1), has mean B, = (6 + gB,) / (1+g) with 
6 = (X’X)'X’y. Clearly 


B if g=0 
_ B, if g is large. 
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It should be noted that (i) when g’s value is unknown, a prior for g can be introduced and g 


can be integrated out, (ii) one can write (2) as yy = X,B+u, with X, of dimension n,xk, 


y, and u, each n,xI1 vector then proceed to construct the g-prior as above, and (iii) one 


can take g = g(n), a function of n, and a proper choice of this function can allow prior 
precision to grow with n and if desired at a rate less than the rate at which the sample 
precision grows. 

Generalised Bayes Rule: Let g(0) be an improper prior distribution in a decision problem with loss 
L(0, a). An action ae ce& for a given x, is generalized Bayes rule which minimizes 


i L(6,a)f(x | 6)g(0)d0, or if 0 < m(x) < ce, which minimizes f L(0,a)g(0| x)d0. 


Generalised Maximum Likelihood Estimate: It is the largest mode of the posterior distribution of the 
parameter. 

Global Robustness: It is robustness of Bayes rule with respect to the prior distribution. A class G of 
priors is considered and Bayes inference and/or decision is obtained with respect to members 
of G. If the Bayes inference and/or decision does not change significantly as the prior varies 
over G, we say that the Bayes rule is globally robust. 

Some of the well-known classes of priors are € -contamination neighbourhoods and distribution 
band class. Sometimes the class is constructed on the basis of ‘shape’ requirements such as 
symmetry and unimodality of the distributions. 

Hartingan’s ALI Prior: Hartigan (1964) defined a prior g(@) to be relatively invariant if 
g(z(8))dz(8)/d8 = cg(®) for some c, whenever z is one-to-one differentiable transformation 
satisfying f(z(x)|z(@))dz(x)/dx = f(x|@) for all x and @. An asymptotic version leads to an 
asymptotically locally invariant (ALI) prior defined in the one-dimensional case by 


E(ff,) 
E(f,) 
maximum likelihood estimate of 0. In particular for binomial example the ALI prior for 0 is 


g(8)c< 1/6(1-8), which is also Haldane’s nil-prior. 
Hierarchical Bayes Model: A hierarchical Bayes model is (f(x|®), g(8)) where the prior distribution g(6) 


a ; 
of; ae et) , and @ is the 


6=0 0-6 


re) 
h f, =—logf(x|0 
where f, ==, logf(|®) 


re) 
—logg(0) = 
50 og g(9) 


is decomposed in conditional distributions g,(0| 8,),g,(0, | 8,),....g, (0,_, |6,) and a marginal 


distribution g,,, (0,) such that 2(0) = f g,(0|9,)g, (9, | 9,)...g, (8, _, | 8, )dO,..d0, . The 
parameter 9, is called hyperparameter of level i (1<i<k). The conditional prior distribution 
g,(0,,|9;) is called the hyperprior of 0, ,. 

Highest Posterior Density Credibility Interval (Region): Let F(®) denote the posterior cdf of 8 given 
x. We seek an interval (a, b) such that 
(i) F(b) — F(a)=. 
(ii) If (a, b) is an HPD interval, for any 0, € (a, b) and any 09, € (a, b), then 
P(O, |x) 2 P(®, | X), and conversely, subject to condition (i). 
In particular, for unimodal symmetric posterior pdfs, the interval (or region) with given 


probability content a, which is centered at the modal value of the posterior pdf, is the 
Bayesian HPD interval (or region). 
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Highest Predictive Density Interval (Region): It is a region (interval) R such that 


P(ye R| x)= | f(y |x)dy, where R is a subspace of the range of y. Highest predictive density 
R 


region with the given probability content is such that the predictive pdf’s values over the 
region are not less than those relating to any other region with the same probability content. 

Huzurbazar, Vasant Shanker (1919-1991): He was awarded doctorate in mathematics in 1950 under 
the supervision of Sir Harold Jeffreys and also the Adams Prize of the University of 
Cambridge. His work on sufficiency and related topics are published in a monograph on 
sufficient statistics and results from his Ph.D. thesis were included in Jeffreys’ book ‘Theory 
of probability’. He served University of Pune from 1953-1976 and University of Denver, 
Colarado from 1979-1991. He was awarded the Padma Bhushan by the Government of India. 

Hyper-Parameter: It is the parameter of the informative prior distribution. This nomenclature is used 
to distinguish between indexing constants of sampling and prior distributions. 

Hyper-prior: It is the prior distribution of the hyper-parameter of the prior. 

Imaginary Training Sample: Smith and Spiegelhalter (1982) suggested a device of Imaginary Training 
Sample to overcome the difficulty arising from the use of improper prior distribution in 
computing Bayes factor. It is an additional data set which involves the smallest possible 
sample size permitting a comparison of models M, and M.,. Further it should provide maximum 
possible support for M.. 

Imprecise Probability: It is a generic term used to describe mathematical models that measure 
uncertainty without precise probabilities. There are many imprecise probability theories like 
upper and lower probabilities, belief functions, Choquet capacities, fuzzy logic, and upper and 
lower previsions. 

Improper Prior: An improper prior is any weight function that sums (or integrates) over the possible 
values of the parameter to a value other than one, say c. If c is finite, then an improper prior 
can induce a proper prior by normalising the function. However, when c is infinite (or otherwise 
does not exist) over the range of possible values of parameter, an improper prior remains 
improper and plays the role of a weighting function. A prior distribution g(6) is said to be 


improper (or generalised) if J g(0)d0 = +e, 
12) 


Indifference Prior: It is introduced by Novick and Hall in 1965. It is a limiting case of conjugate class 
of priors such that (i) prior is improper, (ii) a minimum necessary sample induces a proper 
posterior. For example, in a binomial problem, g(0) = (O(1—0)))" is an indifference prior since 
it is improper and a single success and a single failure induce a proper posterior. 

Induction: It is the experimental testing of a finished theory. It does not originate any idea. 

Inductive Inference: Making inferences from past experience to predict future experience. 

Inference Robustness: It means that inferences made about parameters on the basis of data do not 
change substantially with a change in the model. It is the natural way to examine robustness 
in Bayesian framework since sample as well prior information are taken into account of in an 
Bayesian formulation. 

Informal Robustness: In the informal approach to Bayesian robustness, one tries a few different priors 
(and/or sampling models) to see if the answers vary substantially otherwise one is reasonably 
happy that there is robustness. This approach is widely used by practitioners. 
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Informative Prior: A prior distribution that reflects empirical or theoretical information regarding the 
value of unknown parameter. 

Initial Precision: The accuracy with which statistical decision rules performed in a long series of 
identical experiments is known as initial precision. (first used by Savage, 1962). The classical 
statisticians use initial precision, whereas, Bayesians employ final precision for evaluating 
performance of decision rules. 

Intrinsic Bayes Factor: Let y be the part of the data x = (x,, x,,..., X,) containing the smallest number 


k of observations such that the posterior distributions g;(8;|y) are properly defined and, 
therefore, B,,(y) exists. Berger and Pericchi (1993) called y to be a minimal experiment. If the 


minimal experiment contains k (<n) observations then there are m = ia minimal experiments. 
Let y, denote the observations constituting ith minimal experiment and Za denote the 


corresponding (n-k) remaining observations from x. 
Berger and Pericchi (1993) defined Partial Bayes Factor of model M, to model M,, conditional 


On Yar 
Bp (Z1¥@)= Jie [815 Yo 81.1 YA, 
12 (i) fe (z | 85, Yj )82(8, | Yi) )d0, 
£,(x) 
=——B ), 
f(x) 2 (Yi) 


where B,,(y,)) is the Bayes factor based on the ith minimal experiment y,, and f(x) is the 


marginal density under model M, (j = 1,2). Since partial Bayes factor depends on the choice 
of the minimal training sample Veg they suggested averaging of partial Bayes factors over all 
minimal training samples. 

They defined Arithmetic Intrinsic Bayes Factor as 


1 m il m 
BS = —>'B,, (z| Yi) =B, (x)—'B,, (Yq) 
Mm ja M j=1 


and Geometric Intrinsic Bayes Factor as 


Bi -(I]s.@ 0) =5.00([15.00) 


Inverse Probability: We have a given data and from the information in the data, we try to infer what 
random process generated them. In other words, from known effects, we infer the cause. 
Problems of statistical inference are problems in inverse probability, whereas, many gambling 
problems are problems in direct probability. James Bernoulli (1713), in his book ‘Ars 
Conjectandi’ posed the problem of inverse probability. 

Jeffreys, Harold (1891-1989): Sir Harold Jeffreys was born on April 22, 1891 in England and died on 
March 18, 1989 in Cambridge at the age of 97. He was an accomplished mathematician. His 
contributions had been of fundamental importance to the philosophy of science, scientific 


1/m 
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method, and theoretical and applied statistics. His major works in this area are ‘Theory of 
Probability’ and ‘Scientific Inference’. E.T. Jaynes regards Jeffreys’ works to be a direct 
continuation of Laplace’s research on Bayesian theory, applications and philosophy. In 
addition to producing an inductive statistical calculus for science he made many other 
outstanding contributions to statistical science. 

Jeffreys’ Prior: A non-informative prior distribution proportional to the square root of Fisher 
information. These priors are left invariant. 

Jeffreys’ Thumb Rule: For a scalar parameter 0, the non-informative prior is 
Case (i) : g(8) < constant, if @ € (a, b) or BE (—9, ©), 
Case (i1) : g(8) c< 1/0, if 8 € (0, ©). 

Kullback-Leibler Distance (or Entropy Distance): It is a method for evaluating the distance between 
two distributions f and g and is given by 


f(x) 


I(f,g) = ) el Jos 


It is not a true Cartesian distance and the scale is arbitrary. It is not symmetric, that is, 


I(f,g) # I(g,f). 

Laplace, Pierre Simon de (1749-1827): He was a French mathematician and astronomer. He wrote a 
number of memoires starting from 1774 to 1820 on various aspects of inverse probability. 
Laplace established the posterior median as an optimal estimator which minimizes average 
absolute error and extended it by proving a similar result for squared error. Two well-known 
results proposed by him are ‘principle of insufficient reason’ and the ‘rule of succession’. 

Bayes’ and Laplace’s works have been responsible for suggesting that the parameters and 
observations are fundamentally identical objects. It may be achieved by constructing 
probability distribution on the parameter space. 

Laplace’s Rule of Succession (1774): If we have n independent Bernoulli trials with the probabiltiy 
of success 9 on each trial and we use uniform probabiltiy distribution for the parameter 0 then, 
given r successes in n trials, the probability of success in the (n+1)th trial is (r+1)/(n+2). This 
result is known as Laplace’s rule of succession. It applies to the problem which can be 
reasonably idealized to one with only two hypotheses, a belief in a constant “causal 
mechanisms”, and no other prior information. Furthermore, the rule of succession may not give 
reasonable answers when the number n of trials is very small. 

Law of Total Probability: 

P(A) = P(AIB) P(B) + P(A | B’)P(B). 
In general, if B,,B,, .... subdivide the sample space then 
P(A) = P(AIB,)P(B,) + P(AIB,)P(B,) + ... 

Least Favourable Prior Distribution: A least favourable prior distribution is one that makes the 
minimum Bayes risk as large as possible. 

Likelihood: The likelihood of a model is the probability of the observations assuming that model. 

Likelihood Function: It is a sample density written in the proper order as a function of unknown 
parameter 8, depending on the observed value x. Thus /(8|x) = f(x|@). See Bayes theorem. 

Likelihood Principle: The information brought by an observation x about 0 is entirely contained in 
the likelihood function ¢(0|x). In the statistical inference framework, if x, and x, are two 
observations depending on the same parameter 0, such that there exists a constant c satisfying 
¢ (0|x,) = cé,(0|x,) for every 0, then they bring the same information about 0 and must lead 
to identical inferences. 
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The likelihood principle is automatically satisfied in Bayesian inference since the posterior 
distribution depends on x only through ¢(8|x). Note that information is used in the general 
sense of the collection of all possible inferences on 8 not in the mathematical sense of Fisher’s 
information. Some authors call it a discrete likelihood principle. 

Likelihood (Model) Robustness: A class of models is considered and examined if the inference and/ 
or decision changes significantly. 

Lindley’s Paradox: (Lindley 1957) A paradoxical result that a non-Bayesian could strongly reject a sharp 
null hypothesis H that 8 = 0, while a Bayesian could put a non-zero prior probability p on H 
and then spread remaining (1—p) probability over all other values in a vague way and find 
high posterior odds in favour of H. 

Local Robustness: Suppose we are interested in the senstivity of the posterior quantity y, based on 


the prior distribution g€ G and the sampling distribution f € # , to small deviations from some 
specified choices g, and f,, respectively. In the local robustness approach, one uses suitable 
derivatives of with respect to g and f evaluated at the selected g, and f,, respectively, as 
measures of senstivity to small deviations from g, and f,. 

This approach is computationally easy when senstivity due to uncertainty in a common 
distribution is assessed and the model involves a random sample of quantities from this 
common distribution. 

Locally Uniform Priors: They are normalised or non-normalised uniform prior over a bounded region 
of the support of the unknown parameter. It does not change very much over the region in 
which the likelihood is appreciable and does not take very large values outside this region. 

Loss Function: It is a real valued function defined for all (8, a) © X e¥which represents loss suffered 
if true parameter value is 9 and a, an action, is taken as an estimate of 0. 

Loss Robustness: A group of decision makers may have different ideas about the features of an optimal 
decision rule resulting in different loss functions. In loss robustness studies, one could be 
interested in quantifying changes in the posterior expected loss or in the optimal action. 

Margialisation Paradox: Suppose that we have a model f(x|®, ») and a prior g(0, ) and that the marginal 
posterior g(6|x) satisfies g(6|x) = g(6|z(x)) for some function z(x). Futher suppose that f(z|0, ) 
= f(z|0). It seems that we should be able to recover g(6|x)from f(z|6) and some prior g(8). 
Indeed, if g(8, ) is proper then this is true. However, in some situations, if we use improper 
priors we obtain f(z|®, ) = f(z|6), but f(z|6) g(8) is not proportional to g(8|z(x)) for any g(@) in 
violation of the desirable recoverability condition. David, Stone and Zidek (1973) called this 
a marginalisation paradox. E.T. Jaynes (2003) has discusses it in detail. 


Maximal Data Information Priors: Let (0) = -| f(x | 8) log f(x | 8)dx be the information about X in 
the sampling density. Zellner suggested choosing the prior g(®) that maximises the difference 
G =] 1@)g()d0-[ g(@)log g(0)d0 
The optimal solution is g(0) e exp(I(0)). In particular for Bin(n,0) model, g(0) e< 6°(1-6)!®. 


Maximum Entropy Principle: Out of all the distributions consistent with the constraints E(g,(x)) = a,, 


k = 1, 2, ...,., and »y p; =1, choose the distribution P = {p,, p,,..., p,} that has maximum 


i=l 


uncertainty (entropy) where g,(x) are the functions of the random variable X. 
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Model: A description of the assumed structure of the set of observations that can range from fairly 
imprecise verbal account to, more usually, a formalised mathematical expression of the process 
assumed to have generated the observed data. 

Monte Carlo Methods: Methods for finding solutions to mathematical and statistical problems by 

simulation. 

Moral Expectation: It is the value or pleasure to us of an event. It may not have any relation to the 

monetary value in terms of mathematical expectation. The term was used by Laplace. 

Multiplication Rule: P(A 7B) = P(A)P(B| A) 

Natural Conjugate Prior: A pdf (or pmf) g(0) of the parameter 0 is called the natural conjugate to 

the likelihood function if g(®) and the likelihood are proportional as functions of 0. 

Neccesarist Probabilist: Logical probabilists argue that a probability is the degree of belief. They 
believe in representing different degrees in the relationships between one proposition and 
another, or between a proposition and a body of evidence following principles of mathematical 
logic. Examples are H. Jeffreys, Keynes and Carnap. 

Non-informative Prior: A prior in which little new explanatory power about the unknown parameter 
is provided by intention. It is a formal way of expressing ignorance of the value of the 
parameter over the permitted range. It also represents unprejudiced state of investigator’s 
mind. 

Normal Form of Analysis: It consists of choosing a decision rule to minimise r(g, 5). Bayes risk r(g, 
5) of decision rule 5 with respect to the prior distribution g. The term was introduced by Raiffa 
and Schlaifer (1961). 

Nuisance Parameter: It is a parameter that is included in the probability model for the experiment at 
hand because it is necessary for the good fit of the model but that is of not of primary interest 
to the investigator. 

Occam’s Razor: An early statement of Parsimony principle, given by William of Occam (1280-1349), 
namely ‘entia non sunt multiplicanda praeter necessitatem’, that is, a plurality of reasons should 
not be posited without necessity. Also known as ‘simplicity postulate’. See also principle of 
parsimony. 

Odds in Favour of Event A: Suppose the probability of an event A is P(A). The odds in favour of event 
A is O(A) = P(A)/(1-P(A)). 

p-value: The p-value associated with the statistical test is the smallest significance level « for which 
the null hypothesis is rejected. 

Paradigm: It is a grammatical term but may be used for model or principles. 

Paradox: It is a formal logical contradiction. It arises when two contradictory propositions are equally 
demonstrable. 

Parameter Space: The set of all admissible values of the parameter 0 of a pdf (or pmf) is called the 
parameter space and is denoted by O. 

Parametric Statistical Inference: It is concerned with posing and testing a parametric model when a 
data value has some probability of occurence given some specific value of the parameter. 

Parametric Statistical Model: It consists of an observation of a random variable X distributed according 
to f(x|6), where only the parameter 6 is unknown and belongs to parameter space © of finite 
dimension. 

Parsimony: Unusual or excessive frugality. 
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Partial Bayes Factor: The data x is divided in two parts y and z where y is used as an imaginary 
training sample to provide information about the parameter 0, or 0, of model M, or M., 
respectively, and the remaining data z is used for model comparison. 

Lempers (1971) defined Partial Bayes Factor as the ratio 


B, (zl y)=4,(z| y)/q.(z/ y), 
where 


qi(Z|y)= ff 19;,y)g;(9; | y)d0,, i= 1,2. 
The use of training sample allows one to replace improper prior by proper posterior distribution 
g,(0; | y), 1=1,2. However, one does not know how to divide the data x into y and z. 


Partitioning Paradox: Let 0 = {@,, 0,}, where 8, denotes the event that there is life in the orbit about 
the star Sirius and 0, denotes the event that there is not. Laplace’s principle of insufficient 
reason gives probability P({0,)} = P({9,}) = 1/2. Consider, Q = {@,, w,, @,} where , denotes 
the event that there is life around the star Sirius, w, denotes the event that there are planets 
but no life and @, denotes the event that there are no planets. Then Laplace’s rule gives 
P({@,}) = PU@,}) = PU@,}) = 1/3. The paradox is that the probability of life in the orbit about 
Sirius is P({0,}) = 1/2 if we adopt the first formulation but is P({w,}) = 1/3 if we adopt the 
second formulation. Therefore, the Laplace’s principle of insufficient reason leading to use of 
a uniform prior that assigns equal probability to each point in the parameter space may be 
inconsistent if it is applied to all coarsening and refining of the parameters space 
simultaneously. 

Personal Probability: An approach to allocate probabilities to events. In this approach probability 
represents a degree of belief in a proposition based on all the information. Thus, two people 
with different information and different subjective ignorance may assign different probabilities 
to the same proposition. The only constraint is that a single person’s probabilities should not 
be inconsistent. 

Personalist Probabilist: Subjective probabilists argue that since knowledge varies from individual to 
individual, the quantitative measure of knowledge must vary from individual to individual also. 
Any individual may, therefore, constrain his degrees of belief to obey the probability axioms 
but is otherwise free to assign them he sees fit. Examples are B. deFinneti, F.P. Ramsey, L.J. 
Savage. 

Point Prediction: It is a measure of central tendency of the predictive pdf f(y|x). In the decision theoretic 


framework, let L(y, y) be the loss function where y is a prediction for y. The optimal point 


prediction y is obtained by minimising predictive expected loss with respect to y, provided 


it exists. In other words, optimal point prediction y is a solution to min f Ley, yf (y | x)dy , 
y 
provided it exists. 
Posterior Bayes Factor: Suppose ¢;(0,) is the likelihood function of the model M, (=f (x10;)) 
parametrized by 0, (j = 0,1) for a given data x. 
Aitkin (1991) defined Posterior Bayes Factor as the ratio of posterior means 
E(¢,(0,)|x)/E (¢,(8,)|x) and denoted it by A,,. The values of A,, less than 0.05, 0.01, or 


0.001 constitute strong, very strong, and overwhelming sample evidence against M, in favour 
of M.. 
2 


424 Bayesian Parametric Inference 


Posterior Distribution: It is the conditional density or probability mass function for the parameter, 
given the values of a random sample. See Bayes theorem for random variables. 

Posterior Odds: See Bayes rule. 

Posterior Robustness: An inference or decision is posterior robust with respect to the class of priors 
G if it is satisfactory with respect to the posterior distribution for all priors in G. 

Postulate: A principle or statement in a scientific theory, which is taken as initial proposition, and is 
incapable of proof within the framework of that theory. 

Axiom signifies the initial logical principles of theory and postulate is a initial scientific 
proposition in the theory. 

Preposterior Analysis: It refers to the decision maker’s attempt to look forward, before he gathers the 
data, to the situation in which he will find himself after the data are gathered. In particular, 
he may be interested in deciding the optimal fixed sample size so that the expected loss is not 
larger than some preassigned acceptable value. 

Suppose L(0, a; n) is the loss in observing the sample X of size n and taking an action a, and 


that the prior density g(8) of 8. Let 5° denote the Bayes decision rule for the loss L(®, a) 


having the Bayes risk r(g, n) and c(n) is the cost function. The optimal sample size is that value 
of n that minimizes T(n) = r(g, n) + c(n), where 


r(g,n) = E[E(L@, 5; (x);n) | 8)] 
=, J €@.8:)sn)f(« | &)g(0)dxa0 


=f, €.8:C)sn)g@ | x)d0m(x)dx 


Predictive Distribution (Bayesian): It is the pdf (or pmf) of the as yet unobserved observation y, given 
sample information x. Let us write f(y, ®|x) = f(y|0,x)g(®|x) as the joint pdf of y and the 
parameter 9, given the sample information x. Here, f(y|®@, x) is the conditional pdf for y, given 
8 and x, whereas, g(8|x) is the conditional pdf for @ given x. The predictive pdf f(y|x) is obtained 
as 


f(y|x)=f f(y,0|x)do=[ f(y| 0,x)g(@| x)d0 | 
In case, the unobserved observation y is independent of sample information x, that is, y and 
x have independent conditional pdfs, then f(y| x)= fry | 9)g(O| x)dO . 


Predictive Inference: Inference about an unobserved data. 

Premises: Propositions from which a new proposition or inference is drawn. 

Price, Rev. Richard (1723-1791): He was a dissenting minister, mathematician and political economist. 
His acturial work led to his election as Fellow of the Royal Society in 1765 and the honorary 
degree of DD was awarded to him on the 7th August 1767 by Marischel College, Aberdeen 
and LLD from Yale followed in 1781. He was author of Northampton Life Table and Object of 
Burke’s Oratory invective in Reflection on the French Revolution. 

After the death of Rev. Thomas Bayes, his books and papers were handed over to him 
who later edited Bayes’ works and communicated to John Canton for publication in the 
Philosophical Transactions. He contributed an introduction and an appendix that was added 
to the “Essay” in non-trivial ways. 
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In 1767, Price published a volume entitled ‘Four Dissertations’ in which the fourth essay 
was concerning inverse probability. In this ‘Essay’ he quoted examples illustrating the results 
given in Appendix 2 of Bayes’ Essay. 

Principle: The guiding idea, the basic rule of behaviour. 

Principle of Coherence: The principle of coherence of probabilities states that your assignment of 
probabilities to all possible events should be such that it is not possible to make a definite 
gain by betting with you. 

Principle of Cogent Reason (Principle of Indifference): It is an appeal to the symmetry or homogeneity 
of the experimental situation. 

Principle of Insufficient Reason: It is a subjective concept and a quantitative description of the state 
of being ignorant. Thus if the coin appears essentially symmetric, we appeal to the principle 
of cogent reason and if it does not appear asymmetric we are applying the principle of 
insufficient reason. 

In other words, if a coin is physically symmetric so why should a head or a tail be favoured. 

We must accept that the outcomes are equally likely. On the other hand, if we have no reason 

to believe that one or another face is more likely to arise, then we should act as if that both 

are equally likely. The former situation explains the principle of cogent reason, whereas, latter 
one the principle of insufficient reason. Two conditions must be satisfied before the principle 
of insufficient reason to assign numerical values of probabilities. 

(i) The situation must be analysed into mutually exclusive and exhaustive possibilities. 

(i) Having done this, we must then find the available information gives us no reason to prefer 
any of the possibilities to any other. It may be satisfied as a result of ignorance, or it might 
be satisfied as a result of positive knowledge of the situation. 

Principle of Parsimony: It states that unless there is a very good reason to do otherwise, the more 
parsimonious model should be used. One model is more parsimonious than other if it can be 
completely specified using a smaller number of parameters of the model. 

Principle of Precise Measurement: (L.J. Savage, 1962) This is the kind of measurement we have, when 
the data are so incisive as to overwhelm the initial opinion. In Bayesian framework, it may be 
interpreted as follows: 

If the prior distribution of the parameter @ is more or less constant over that range of 0 for 

which the likelihood function is appreciable, and not too large over that range of @ for which 

the likelihood function is small, then the posterior distribution is approximately equal to the 

(normalized) likelihood function and the prior does not have much of an effect on the posterior 

distribution. 

Elsewhere, Savage describes this principle as one of stable estimation. Historically Jerzy 

Neyman attributes it to Bernstein and independently to Von Mises, both during the period 

1915-1920. 

Principle of Separation of Prior Information and Current Data: It suggests that when assessing prior 
probabilities, use only that information which is not included in the likelihood. 

Prior Information: It represents all that is known or assumed about the parameter 0 before, or other 
than, the observation of empirical data. 

Prior Distribution: Density or probability mass function expressing prior belief about value for the 
parameter. See Bayes theorem for random variables. 

Prior Odds: See Bayes rule 


426 Bayesian Parametric Inference 


Prior Predictive Distribution: Let f(x|6) be the conditional distribution of x, given the parameter value 
8, and g(@) be the prior distribution of 8. The distribution of the unknown but observable x 


f(x) = [ £(x,0)d0 =[ F(x | 0) g(0)d9, 


the marginal distribution of x, is known as prior predictive distribution . It is prior since it is 
not conditional on a previous observation of the process and predictive because it is the 
distribution of a quantity that is observable. 

Proper Prior: A proper prior is one that allocates positive weights that total one to the possible values 
of the parameter. It satisfies the definition of probability mass function (or pdf). 

Public Policy Prior: All observers adopt the same prior as representing a prior state of indifference 
to one parameter value over another. It is also known as objective prior. 

Public Probabilities: These are based on a given relative frequency and, therefore, a person will 
constrain his probability to be consistent with the publicly known frequency. 

Rationality Axioms: Let Pbe the set of probability distributions. 
Axiom I: If P, and P, are in PH then either P.<P,, P=P, or P< P.. 


Axiom 2: If P, <P, and P, <P, then P, <P,. 


Axiom 3: If P, < P, then oP, +(1—-@)P, < oP, +(1-@)P, for any 0<a<1 and P,P 
Axiom 4: If P< P,< P,, there are numbers 0 < a < 1 and 0 < B < 1 such that 
oP, +(1—a)P, <P, and P, <BP, +(1—f)P, . 

Note that the P< P, means that P, is preferred to P, 

P, =P, means that P, is equivalent to P, and 

P| <P, means that P, is not preferred to P.,. 

Reference Prior: It is a prior which is convenient to use as a standard and is dominated by the 
likelihood. According to Kass and Wasserman (1996), reference priors are an expression of 
ignorance or as a socially agreed upon standard alternative to subjective priors. Some authors 
refer to them as automatic priors or default priors. Bernardo (1979) proposed reference prior 
distribution based on entropy information measure to overcome the problems encountered in 
using Jeffreys’ invariant non-informative prior. 

Renyi, Alfred (1921-1970): He was born in Budapest, Hungary. He contributed to probability theory, 
random graphs and information theory. 

Renyi’s Axiom System: It is a countable additivity axiom system based upon conditioning. Suppose 
@ denotes a non-empty system of sets such that @ is contained in a o-algebra cWof 
subsets of the sample space of the random experiment. The conditional probability P(A|B) is 
defined for Ac c&% Be Bif 
(i) for any events A and B, P(A|B) 20 and P(BIB) = 1, 

(ii) for mutually exclusive events A,, A,, ... and some event B 


(U A, 15 |-5 P(A, |B) 
i=l i=l 
(ii) for every event collection (A, B, C), BCC, P(B| C) >0, we have 
P(AMB|C) 
P(B|C) 


The countable additivity axiom permits us to have continuity and asymptotic theory. 


P(A|B) = 
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Risk Function: The risk function of a decision function 5(x) is the expected loss incurred in using 6(x). 
Mathematically, the risk function R(8, 5) = EL(0, 5(x)), where the expectation is taken with 
respect to f(x|@). 

Robbins, Herbert (1915-2001): He was Higgins professor of Mathematical Staistics at Columbia 
University, USA. His important contributions are empirical Bayes methods and stochastic 
approximation and his very popular book ‘What is mathematics?’. 

Robust Bayesian Analysis: It is a study of senstivity of posterior quantities of interest with respect 
to some or all of the uncertain inputs like prior distribution, the likelihood function, and the 
loss (utility) function. 

Robustness: It means lack of senstivity of the conclusions to reasonable violations of the assumptions 
involving choice of the model, prior, and/or the loss function. 

Sample Space: A set of possible n outcomes of a statistical investigation performed to obtain 
information about 8 is known as sample space. Usually the sample space S will be subset of 
n-dimensional Euclidean space. 

Savage, Leonard James (1917-1971): He was born in Detroit, USA, and is well-known for his book 
‘The Foundations of Statistics’ published in 1954 and also for the principle of stable estimation 
(or principle of precise measurement) and axioms of rationality. 

Schwarz’s Criterion: See Bayes information criterion. 

Senstivity Analysis: It is the process of examining changes in the conclusions caused by changes in 
the initial assumptions. 

Shannon’s Measure of Entropy (Uncertainty): Let p = (p,, p,, ..., p,) be a probability distribution such 


that p, 20,1=1,2,...n and ys p; =1, Shannon’s measure of entropy (uncertainty) for p is 


i=l 


given by &,(p)=—)) Pp; logp,. 
i=l 
Simulation: The artificial generation of a random processes to initiate the behaviour of a particular 
statistical model. 
St. Petersberg Paradox: Suppose that a person is to be given the opportunity of playing the following 
game: A fair coin is to be tossed repeatedly until a head is obtained for the first time. If the 
first head is obtained on the nth toss then his reward will be 2" rupees (n = 1, 2, ...). The 


expected monetary gain will be >> (1/ 2)’ = oo, If his utility function is linear then he should 
n=l 

be willing to pay any arbitrarily large amount for the opportunity of playing the game. However, 

infact, each person will like to pay only a specific finite amount which depends on his utility 

function. This fact was called the St. Petersberg paradox. 

The paradox can be resolved if we define the utility function U for a change in fortune, that 

is, U(O) = 0 then if it costs c rupees to play the game then the true value of playing is 


» U(2" -c)2™. Thus one should play the game if the above expected utility is positive, 
n=l 
otherwise not. 

Statistical Prediction: It is a statistical inferential problem whose solution depends on our envisaging 
some future occurrence. 
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Stigler’s Law of Eponymy (1980): No scientific discovery is named after its original discoverer. For 
example, f(x) = exp(—x’/2) was used by Laplace in 1774, 3 years before Gauss was born and 
by A. deMoiver in 1733, 16 years before the Laplace was born. 

Sufficiency Principle: Two observations x and y factorizing through the same value of a sufficient 
statistic T, that is, such that T(x) =T(y), must lead to the same inference on the parameter 0. 

Sufficient Statistic: A function T of a sample x from the population having pdf(pmf) f(x|®) is said to 
be sufficient if the distribution of X, conditional upon T(x), does not depend on 8. 

Sufficient Statistic (Bayesian): For any prior pdf (or pmf) g(@) and any observed value x in S, let g(6|x) 
denote the posterior pdf (or pmf) of 0 then a statistic t is a sufficient statistic for the families 
of pdfs (or pmfs) {f(x|8); 0¢ ©} if g(8|x,) = g(8|x,) for any prior pdf (or pmf) g and any two 
points x,, x,¢S such that T(x,) = T(x,). 

In other words, a statistic T is called a sufficient statistic if, for any prior distribution of 0, 
its posterior distribution depends on the observed value of x only through T(x). 

Sure-thing Principle: A distribution Pe Wis defined to be bounded if there exists rewards r, and r, 
such that r, is not preferred to r, and P[r,, r,] = 1. Consider bounded probabilty distributions 
P|, P,¢ Pand any probability a. Let «oP, + (1-)P, denote the unconditional distribution which 
refers to P, with probability a and to P,, the probability 1-a. Then for any other bounded 
probability distribution Pe FW and any real number o belonging to (0, 1), P, is not preferred 
to P, is equivalent to oP, + (1-a)P is not preferred to oP, + (1-a)P. 

According to Savage (1972), if your preference does not satisfy expected utility hypothesis, 
it would be possible to place repeated bets with you, which if accepted, in the long run would 
cause you to be a sure loser (lose on average) in the long run. 

Vague Prior: Prior information becomes vague when the sample information is so incisive that the 
posterior distribution is approximately same as likelihood function. A prior distribution 
resulting in a posterior distribution approximately as a likelihood function is called a vague 
prior. This is certainly true when the prior distribution is the uniform distribution. 

Wald, Abraham (1902-1950): He was Born in Cluj, Hungary and emigrated to USA in 1938. His 
important contributions are in statistical decision theory and sequential analysis. 
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Standard Distributions and their 


Name and Notation 


1. Bernoulli 
Bernoulli(6, r) 


2. Discrete Uniform 
3. Binomial 
Bin(n, 9) 
4, Negative Binomial 
NBin(6, r) 
5: Pascal 


Pascal(, r) 


6. Poisson 
Pois(0) 

7. Hypergeometric 
Hy(A, B, n) 

8. Uniform 


U(a, b) 
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Name and Notation f(x|9) 
9. Normal 
N(8, 0) 
10. Generalised Inverse 
Normal 
GIN(, HM, T) 
11. Lognormal 
lognormal(0, 6”) 
12. Inverse Gaussian 
Inverse Gaussian(u,A) 
13. Gamma 
Gamma(a, b) 
14. Chi-square with n df 
15. Inverted Gamma 
Inverted-Gamma(a,b) 
16. Inverted-x’ with n df 
17. Beta 
Beta(a, b) 
18. Inverted Beta 


Inverted-Beta(a, b) 


1 x? 
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Name and Notation f(x|0) 
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Name and Notation 


28. 


29. 


30. 


31. 


Inverted-Wishart 
Inverted-Wishart(n, G) 


Multivariate-t 


Multinomial 
Multinomial(n, @) 


Dirichlet 
Dirichlet (0, ae ) 
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Table-II 


Natural Conjugate Prior for some 


Standard Distributions 


Distribution 


Bin(n,8), n known 
NBin(r, 8), r known 
Pois(0) 
Hypergeometric(a,b,c) 
UC, 8) 

N(0, 67), 0? known 
N(0, 6”), 8 known 
N(O, r), 8 known 
N(0, 67) 

N(O, r) 

Pareto(k, 8), k known 


Gamma(m, 8), m known 


Inverted-Gamma(m, 8), m known 


MVN(0,r) , precision matrix r known 
MVN(06,r), 8 known 


MVN(0,0r), precision matrix r known 


MVN(0.1) 


Multinomial(n, 6 ) 


Natural Conjugate Prior 


Beta(a,, a, 

Beta(a,, a,) 

Gamma(a,, a,) 

Bin(a,, a,) 
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Dirichlet( 1, L,) 
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Table-III 


Jeffreys’ Prior 


Distribution Jeffreys’ Prior 


Bernoulli(6) (8(1-8))!? 
NBin(1,9) 6(1-0)-"!2 
Pois(6) 

U(0, 8) 

N(O0, 0°) 

Niu, 0”) 

N(O, 67) 

Weibull(p, 8) 
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Loss Functions and corresponding Bayes Estimates 


Loss Function 


1. Squared Error 
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Loss Function 


10. Generalised 
Entropy 


ll. Weighted 
Entropy 


12. Balanced Loss 


13. Weighted BLF 


14. Quasi- 
Quadratic 


15. Reflected 
Normal 


16. Squared 
Logarithmic 
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Loss Function L@, a) Bayes Estimate 
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